MAST employs a rigorous, multi-stage evaluation methodology grounded in established medical education and assessment frameworks. Each benchmark undergoes expert validation, adversarial testing, and statistical calibration to ensure that results are reproducible, clinically meaningful, and resistant to shortcut learning.
MAST (Medical AI Superintelligent Testing) is a comprehensive evaluation framework designed to measure whether AI systems can safely and accurately support clinical decision-making. Unlike traditional medical AI benchmarks that focus narrowly on diagnostic accuracy, MAST evaluates the full spectrum of clinical competence — from harm avoidance and therapeutic appropriateness to probabilistic reasoning under uncertainty.
Built on real-world clinical data from over 16,000 eConsults across 10 medical specialties, MAST uses expert physician annotations and a WHO-aligned harm taxonomy to quantify the safety profile of AI-generated medical recommendations. The framework establishes human physician baselines and tests multiple AI agent configurations to provide a rigorous, clinically meaningful assessment of medical AI readiness.
MAST evaluations are grounded in the electronic consultation (eConsult) format — a real-world clinical workflow where primary care physicians seek specialist guidance for patient cases. We curate 100 representative cases drawn from a corpus of 16,399 de-identified eConsults spanning 10 medical specialties: cardiology, dermatology, endocrinology, gastroenterology, hematology, infectious disease, nephrology, neurology, pulmonology, and rheumatology.
Each case is presented to AI systems in three distinct agent configurations: Solo (single model generating recommendations independently), Advisor (model augmented with retrieval-based clinical references), and Guardian (multi-agent setup with a safety-checking layer). This multi-configuration approach reveals how architectural choices affect clinical safety and accuracy.
Model responses are evaluated by board-certified specialists using standardized rubrics that assess both the quality of clinical recommendations and the potential for patient harm.
MAST employs a comprehensive set of metrics designed to capture distinct dimensions of clinical AI performance. Each metric is defined with a precise formula and clinical rationale.
Overall measure of harm avoidance across all evaluated cases. Represents the proportion of recommendations that do not introduce clinically significant harm.
Measures whether the AI response addresses all clinically relevant aspects of the consultation, including differential diagnosis, workup, and management plan.
Quantifies the model's tendency to avoid overstepping its scope — not recommending unnecessary tests, treatments, or referrals beyond what is clinically indicated.
The proportion of AI-generated recommendations that are clinically correct and appropriate for the given case.
The proportion of clinically necessary recommendations that the AI system successfully identifies and includes.
Frequency at which the AI appropriately identifies cases requiring urgent specialist attention or emergency intervention.
The proportion of evaluated cases where the AI response would introduce clinically significant harm if followed.
The number of AI consultations needed before one case of clinically significant harm occurs. Higher values indicate safer systems.
Our evaluation dataset includes 12,747 specialist annotations collected from board-certified physicians across all 10 specialties. Each AI-generated recommendation is reviewed by at least two independent specialists, with a third adjudicator resolving disagreements.
Inter-rater reliability is measured using Cohen's kappa and intraclass correlation coefficients, with minimum thresholds required for inclusion in final scoring. Annotations follow a WHO-aligned harm taxonomy with five severity levels: no harm, mild (temporary discomfort), moderate (prolonged recovery), severe (life-threatening), and death.
Annotators assess both errors of commission (harmful recommendations actively given) and errors of omission (critical recommendations that were missed). This dual assessment ensures that models are evaluated not just on what they say, but on what they fail to say.
To contextualize AI performance, MAST establishes human physician baselines using the same evaluation framework. Generalist physicians (non-specialists) achieve a baseline safety score of 46.0% on the NOHARM benchmark, reflecting the inherent difficulty of specialist-level clinical reasoning.
This baseline was established by having board-certified primary care physicians respond to the same eConsult cases presented to AI systems, with their responses evaluated by the same specialist panels using identical rubrics. The generalist baseline serves as a critical reference point: AI systems that consistently outperform generalist physicians on safety metrics may offer genuine clinical value as decision-support tools.
MAST is designed to be transparent about its scope and limitations. The current benchmark suite does not evaluate the following areas, which represent important directions for future work:
To ensure benchmark integrity, MAST implements a data gating approach to contamination prevention. Benchmark cases are held in a secure, access-controlled repository and are not published in any publicly crawlable format. Model providers must submit their systems for evaluation through our controlled pipeline rather than accessing test data directly.
We additionally perform retrospective contamination checks by comparing model outputs against training data disclosures and testing for memorization patterns. Cases showing evidence of contamination are flagged and excluded from scoring. The benchmark dataset is periodically refreshed with new cases to maintain evaluation validity over time.
Case
Real eConsult clinical case selected from curated corpus
Model Response
AI system generates clinical recommendations
Expert Annotation
Board-certified specialists review and score
Harm Scoring
WHO-aligned taxonomy applied to identify harm severity
Metrics
Safety, completeness, and accuracy scores computed
Case
Real eConsult clinical case selected from curated corpus
Model Response
AI system generates clinical recommendations
Expert Annotation
Board-certified specialists review and score
Harm Scoring
WHO-aligned taxonomy applied to identify harm severity
Metrics
Safety, completeness, and accuracy scores computed