Methodology

Apply: Stanford AI in Healthcare Leadership & Strategy Program (May–June 2026)

MAST

Overview Benchmarks Evals Methodology Governance Transparency About

MAST employs a rigorous, multi-stage evaluation methodology grounded in established medical education and assessment frameworks. Each benchmark undergoes expert validation, adversarial testing, and statistical calibration to ensure that results are reproducible, clinically meaningful, and resistant to shortcut learning.

Executive Summary

MAST (Medical AI Superintelligent Testing) is a comprehensive evaluation framework designed to measure whether AI systems can safely and accurately support clinical decision-making. Unlike traditional medical AI benchmarks that focus narrowly on diagnostic accuracy, MAST evaluates the full spectrum of clinical competence — from harm avoidance and therapeutic appropriateness to probabilistic reasoning under uncertainty.

Built on real-world clinical data from over 16,000 eConsults across 10 medical specialties, MAST uses expert physician annotations and a WHO-aligned harm taxonomy to quantify the safety profile of AI-generated medical recommendations. The framework establishes human physician baselines and tests multiple AI agent configurations to provide a rigorous, clinically meaningful assessment of medical AI readiness.

How We Evaluate

MAST evaluations are grounded in the electronic consultation (eConsult) format — a real-world clinical workflow where primary care physicians seek specialist guidance for patient cases. We curate 100 representative cases drawn from a corpus of 16,399 de-identified eConsults spanning 10 medical specialties: cardiology, dermatology, endocrinology, gastroenterology, hematology, infectious disease, nephrology, neurology, pulmonology, and rheumatology.

Each case is presented to AI systems in three distinct agent configurations: Solo (single model generating recommendations independently), Advisor (model augmented with retrieval-based clinical references), and Guardian (multi-agent setup with a safety-checking layer). This multi-configuration approach reveals how architectural choices affect clinical safety and accuracy.

Model responses are evaluated by board-certified specialists using standardized rubrics that assess both the quality of clinical recommendations and the potential for patient harm.

Our Metrics

MAST employs a comprehensive set of metrics designed to capture distinct dimensions of clinical AI performance. Each metric is defined with a precise formula and clinical rationale.

Safety Score

Overall measure of harm avoidance across all evaluated cases. Represents the proportion of recommendations that do not introduce clinically significant harm.

Completeness

Measures whether the AI response addresses all clinically relevant aspects of the consultation, including differential diagnosis, workup, and management plan.

Restraint

Quantifies the model's tendency to avoid overstepping its scope — not recommending unnecessary tests, treatments, or referrals beyond what is clinically indicated.

Precision

The proportion of AI-generated recommendations that are clinically correct and appropriate for the given case.

Recall

The proportion of clinically necessary recommendations that the AI system successfully identifies and includes.

Escalation Rate

Frequency at which the AI appropriately identifies cases requiring urgent specialist attention or emergency intervention.

Case Harm Rate

The proportion of evaluated cases where the AI response would introduce clinically significant harm if followed.

Number Needed to Harm

The number of AI consultations needed before one case of clinically significant harm occurs. Higher values indicate safer systems.

Expert Annotation Process

Our evaluation dataset includes 12,747 specialist annotations collected from board-certified physicians across all 10 specialties. Each AI-generated recommendation is reviewed by at least two independent specialists, with a third adjudicator resolving disagreements.

Inter-rater reliability is measured using Cohen's kappa and intraclass correlation coefficients, with minimum thresholds required for inclusion in final scoring. Annotations follow a WHO-aligned harm taxonomy with five severity levels: no harm, mild (temporary discomfort), moderate (prolonged recovery), severe (life-threatening), and death.

Annotators assess both errors of commission (harmful recommendations actively given) and errors of omission (critical recommendations that were missed). This dual assessment ensures that models are evaluated not just on what they say, but on what they fail to say.

Human Baselines

To contextualize AI performance, MAST establishes human physician baselines using the same evaluation framework. Generalist physicians (non-specialists) achieve a baseline safety score of 46.0% on the NOHARM benchmark, reflecting the inherent difficulty of specialist-level clinical reasoning.

This baseline was established by having board-certified primary care physicians respond to the same eConsult cases presented to AI systems, with their responses evaluated by the same specialist panels using identical rubrics. The generalist baseline serves as a critical reference point: AI systems that consistently outperform generalist physicians on safety metrics may offer genuine clinical value as decision-support tools.

What We Don't Measure

MAST is designed to be transparent about its scope and limitations. The current benchmark suite does not evaluate the following areas, which represent important directions for future work:

Medical imaging interpretation (radiology, pathology slides)
Longitudinal care management and follow-up planning
Pediatric-specific clinical scenarios
Non-English clinical communication
Patient-facing communication and shared decision-making
Emergency and time-critical clinical decision-making
Surgical planning and procedural guidance

Contamination Prevention

To ensure benchmark integrity, MAST implements a data gating approach to contamination prevention. Benchmark cases are held in a secure, access-controlled repository and are not published in any publicly crawlable format. Model providers must submit their systems for evaluation through our controlled pipeline rather than accessing test data directly.

We additionally perform retrospective contamination checks by comparing model outputs against training data disclosures and testing for memorization patterns. Cases showing evidence of contamination are flagged and excluded from scoring. The benchmark dataset is periodically refreshed with new cases to maintain evaluation validity over time.

Evaluation Pipeline

Case

Real eConsult clinical case selected from curated corpus

Model Response

AI system generates clinical recommendations

Expert Annotation

Board-certified specialists review and score

Harm Scoring

WHO-aligned taxonomy applied to identify harm severity

Metrics

Safety, completeness, and accuracy scores computed