ARISE
ARISE Logo

MAST

Methodology

MAST is a multi-benchmark evaluation suite grounded in established medical education and assessment frameworks: the Script Concordance Test for probabilistic reasoning, NEJM Clinicopathological Conferences for diagnostic workups, WHO-aligned harm taxonomies for safety, the eConsult workflow for specialist-level triage, FHIR-based task completion for agentic clinical work, and physician-authored conversational rubrics for patient-facing dialogue. Each benchmark is scored against an expert reference standard and combined into a single composite score, so results are reproducible, clinically meaningful, and resistant to shortcut learning.

Executive Summary

MAST is a comprehensive evaluation framework designed to measure whether AI systems can safely and accurately support clinical decision-making. Unlike benchmarks that focus narrowly on a single task such as multiple-choice diagnosis, MAST evaluates the full spectrum of clinical competence across five dimensions: harm avoidance (NOHARM v1 and First Do NOHARM v2), probabilistic clinical reasoning under uncertainty (Script Concordance Test), diagnostic reasoning and information gathering (CPC-Bench), multimodal image interpretation (ReXrank chest radiology and Multimodal Derm), and agentic task completion inside clinical systems (MedAgentBench v2, executed against a FHIR server).

Every benchmark is built on an expert-authored reference standard: specialist-graded rubrics, NEJM CPC expert diagnoses, Script Concordance expert panel distributions, gold FHIR task completions, dermatologist-labeled images with Fitzpatrick annotations, and HealthBench's 48,562 rubric criteria authored by 262 physicians across 60 countries. Per-benchmark scores are combined into a single MAST composite via a weighted harmonic mean, so a model cannot offset a safety failure with strength on another axis. Human physician baselines are measured on the same evaluations using the same scoring panels, providing a clinically meaningful reference for AI readiness.

How We Evaluate

Each MAST benchmark runs as a two-stage pipeline. A standardized inference stage queries the model under a fixed prompt and sampling configuration. A scoring stage then compares responses to the benchmark's reference standard, producing a uniform row format (category, metric, trials, mean, confidence interval) that feeds the composite. Scoring is deterministic where ground truth exists (multiple-choice questions, Script Concordance Likert distributions, FHIR task success, image labels, radiology reference reports) and panel-based where answers are free text (multi-LLM judge panels with cluster-bootstrap confidence intervals for First Do NOHARM).

Across the suite, MAST evaluates each model on tens of thousands of expert-authored items spanning ten medical specialties. The current suite covers 42 NOHARM v1 specialist cases (with 4,249 graded option rows), 100 First Do NOHARM v2 cases played under a base version plus 10 adversarial perturbations each for 1,100 total items, 750 Script Concordance items with expert panel distributions, 100 NEJM Clinicopathological Conferences scored for differential diagnosis and management plus 2,002 board-style QA items and 1,173 image-based VQA items, 300 MedAgentBench v2 patient tasks across 10 FHIR task categories, three chest X-ray cohorts (IU X-ray, CheXpert Plus, MIMIC-CXR) for ReXrank, and four dermatology datasets (DDI, PAD-UFES-20, Fitzpatrick 17k, SCIN) for Multimodal Derm, with Fitzpatrick skin-tone equity gaps reported alongside accuracy.

Per-benchmark primary metrics are combined into a single MAST composite. Parent benchmarks (CPC-Bench, Multimodal) are the arithmetic mean of their children's primary metrics; the MAST composite is the weighted harmonic mean across the weighted benchmarks (First Do NOHARM 0.1, Script Concordance 0.1, CPC-Bench 0.1, Multimodal 0.2, MedAgentBench 0.1). The harmonic mean is deliberate: a model with a serious weakness on any single axis, especially harm avoidance, cannot mask it with strength elsewhere.

Our Metrics

MAST employs a comprehensive set of metrics designed to capture distinct dimensions of clinical AI performance. Each metric is defined with a precise formula and clinical rationale.

Safety Score

Overall measure of harm avoidance across all evaluated cases. Represents the proportion of recommendations that do not introduce clinically significant harm.

Completeness

Measures whether the AI response addresses all clinically relevant aspects of the consultation, including differential diagnosis, workup, and management plan.

Restraint

Quantifies the model's tendency to avoid overstepping its scope — not recommending unnecessary tests, treatments, or referrals beyond what is clinically indicated.

Precision

The proportion of AI-generated recommendations that are clinically correct and appropriate for the given case.

Recall

The proportion of clinically necessary recommendations that the AI system successfully identifies and includes.

Escalation Rate

Frequency at which the AI appropriately identifies cases requiring urgent specialist attention or emergency intervention.

Case Harm Rate

The proportion of evaluated cases where the AI response would introduce clinically significant harm if followed.

Number Needed to Harm

The number of AI consultations needed before one case of clinically significant harm occurs. Higher values indicate safer systems.

Expert Annotation Process

Every MAST benchmark is anchored to an expert reference standard authored by board-certified clinicians. First Do NOHARM and NOHARM v1 use specialist-graded option-level rubrics with a WHO-aligned five-level harm taxonomy (no harm, mild, moderate, severe, death) covering both errors of commission (harmful recommendations given) and errors of omission (critical recommendations missed). CPC-Bench uses the expert differential and management plans published with each NEJM Clinicopathological Conference. Script Concordance items use aggregated Likert distributions from expert physician panels. HealthBench uses 48,562 rubric criteria authored by 262 physicians across 60 countries. MedAgentBench v2 uses gold FHIR task completions verified against a live FHIR server. Multimodal Derm uses dermatologist-labeled images with Fitzpatrick skin-tone annotations, and ReXrank uses reference radiology reports from three independent chest X-ray cohorts.

Scoring mechanisms are matched to each benchmark's reference standard. Deterministic rule-based scoring is used where a definitive answer exists: Script Concordance (alignment of model Likert ratings with expert panel distributions), NOHARM v1 (match of Appropriate/Inappropriate grades to specialist rubrics), MedAgentBench (pass/fail on FHIR task execution, Wilson confidence intervals), CPC-Bench QA/VQA (multiple-choice accuracy), and Multimodal Derm (balanced accuracy plus an explicit Fitzpatrick equity gap). Free-text benchmarks are scored by multi-LLM judge panels with cluster-bootstrap confidence intervals (First Do NOHARM), and reference-report metrics are used for radiology (ReXrank: RadCliQ, GREEN, RaTEScore, RadGraph, BLEU, BERTScore).

For the harm-avoidance benchmarks specifically, inter-rater reliability is quantified with Cohen's kappa for categorical grades and intraclass correlation coefficients for severity ratings, with minimum thresholds required for inclusion in final scoring. Annotators assess both errors of commission and errors of omission, so models are evaluated not only on what they say but on what they fail to say.

Human Baselines

To put AI performance in context, MAST establishes human physician baselines on the same evaluations using the same scoring panels. On First Do NOHARM v2, board-certified generalist physicians respond to the identical eConsult cases presented to AI systems, and their free-text management plans are judged by the same multi-LLM panel under the same rubrics; the resulting MASTERI human arm sets a concrete generalist-physician floor on harm, completeness, restraint, and triage. On the Script Concordance Test, the benchmark is intrinsically human-anchored: items carry aggregated expert panel distributions, and each model response is scored by how closely it tracks the expert consensus. On CPC-Bench, the NEJM expert differential and management plan serve as the reference for diagnostic and therapeutic reasoning. On HealthBench, the 262-physician rubric authoring effort establishes the ceiling against which conversational responses are scored.

These human baselines serve as critical reference points. AI systems that consistently outperform generalist physicians on harm avoidance while matching expert panels on reasoning, imaging, and agentic tasks may offer genuine clinical value as decision-support tools. Systems that outperform on reasoning alone but underperform on harm are specifically the pattern MAST is designed to detect.

What We Don't Measure

MAST is intentionally scoped to clinician-facing decision support, and there are whole classes of clinical competence it does not attempt to measure. MAST does not evaluate hands-on procedural or surgical skill, physical examination technique, or any ability that requires a physician to be in the room with a patient. It does not score bedside manner, voice or speech-based patient interactions, nonverbal communication, or empathy in live dialogue. It does not measure longitudinal care over months or years, long-term treatment adherence, or outcomes derived from prospective patient follow-up. It does not evaluate regulatory, compliance, billing, coding, or reimbursement decisions. And it does not evaluate general-purpose reasoning, coding, or productivity tasks outside the clinical workflow.

MAST also scores each model under a fixed prompt and sampling configuration. It does not attempt to find the best possible prompt, chain-of-thought scaffold, retrieval pipeline, or agent harness for any given model. Performance with custom prompting, retrieval augmentation, or specialist fine-tuning may differ from the scores reported here. Real-world operational factors such as latency at scale, cost per query, and integration with electronic health record systems beyond the FHIR server used by MedAgentBench are also out of scope for the composite.

Contamination Prevention

To ensure benchmark integrity, MAST implements a data gating approach to contamination prevention. Benchmark cases are held in a secure, access-controlled repository and are not published in any publicly crawlable format. Model providers must submit their systems for evaluation through our controlled pipeline rather than accessing test data directly.

We additionally perform retrospective contamination checks by comparing model outputs against training data disclosures and testing for memorization patterns. Cases showing evidence of contamination are flagged and excluded from scoring. The benchmark dataset is periodically refreshed with new cases to maintain evaluation validity over time.

Evaluation Pipeline

  1. 1

    Item selection

    Cases drawn from expert-curated sources: NEJM CPCs, de-identified eConsults, physician rubrics, FHIR scripts, dermatologist-labeled images, and reference chest radiograph cohorts.

  2. 2

    Standardized prompting

    Each benchmark has a fixed prompt template and sampling configuration versioned in the repo, so every model is asked the same question in the same way.

  3. 3

    Inference

    Model responses collected with retries and concurrent execution. Only essential fields are persisted; raw case text stays private to prevent redistribution.

  4. 4

    Scoring

    Deterministic rule-based scoring where a definitive answer exists; free-text responses are scored by multi-LLM judge panels with cluster-bootstrap CIs.

  5. 5

    Per-benchmark metrics

    Scores written in a uniform row format (category, metric, trials, mean, confidence interval) so every benchmark plugs into the composite the same way.

  6. 6

    Aggregation

    A compile step merges all per-benchmark CSVs, rolls up parent benchmarks as arithmetic means, and computes the MAST composite as a weighted harmonic mean.

  7. 7

    Publication

    Composite and per-benchmark results published on the public leaderboard with confidence intervals, per-specialty breakdowns, and links to raw artifacts.

Methodology and Weighting

The MAST composite is a weighted harmonic mean of five primary benchmark scores. The current weights are: harm avoidance (First Do NOHARM v2) 0.10, probabilistic clinical reasoning (Script Concordance Test) 0.10, diagnostic reasoning and information gathering (CPC-Bench) 0.10, multimodal image interpretation (ReXrank radiology plus Multimodal Derm) 0.20, and agentic clinical task completion (MedAgentBench v2) 0.10. Weights reflect the relative clinical risk and breadth of each dimension and are reviewed by the MAST Steering Committee on a scheduled cycle.

Two choices distinguish MAST from benchmarks that publish a single arithmetic average. First, the harmonic mean penalizes weak performance on any single axis: a model cannot mask a safety failure with strength on another axis. Second, parent benchmarks (CPC-Bench, Multimodal) are themselves rolled up as the arithmetic mean of their children's primary metrics before entering the composite, so a model is not rewarded for dominating one imaging dataset while failing others. Additional benchmarks such as NOHARM v1 and HealthBench are reported alongside the composite but are not currently weighted into it; new benchmarks are considered on a rolling cycle and added only after validation against expert consensus.