ARISE
ARISE Logo

MAST

Methodology

MAST is a multi-benchmark suite measuring whether AI systems can safely and effectively support clinical decisions. We combine robust, realistic clinical benchmarks across key dimensions and maintain both open-source and private eval sets, providing a central resource for independent academic evaluation of clinical AI.

Executive Summary

MAST is a systematic evaluation framework to measure the performance of AI systems in healthcare. We curate and combine robust benchmarks across several critical clinical dimensions. The benchmarks selected for each dimension are generally based on real patient data or highly validated clinical frameworks, and include both multimodal and agentic tasks. We actively develop human physician baselines on these evaluations whenever possible, to provide a meaningful reference for AI readiness. See our Benchmarks page for details on each evaluation, and the technical leaderboard to explore the full dataset.

Evaluation Pipeline

  1. 1

    Benchmark selection

    We curate validated benchmarks that together cover the clinical dimensions essential to safe, effective decision support.

  2. 2

    Inference

    Each model is run across all benchmarks with retries and concurrent execution; only essential fields are stored, and raw case text stays private. Standardized prompting: head-to-head comparisons use standardized, preferably minimal prompts so every model is asked the same way; sensitivity analyses additionally apply prompt engineering to measure how much prompting shifts results.

  3. 3

    Scoring & judging

    Responses are scored against each benchmark's reference standard: deterministic rule-based scoring where a definitive answer exists, and multi-LLM judge panels with cluster-bootstrap confidence intervals for free-text responses.

  4. 4

    Distillation

    We extract cross-benchmark themes that no single benchmark captures alone, such as diagnostic reasoning measured across multiple benchmarks or performance within a specific specialty, to surface meta-level patterns in clinical competence.

  5. 5

    Aggregation & composite

    Benchmark and theme scores roll up into clinical dimensions and into the MAST General and Clinical Composites. See How We Evaluate for the full method.

  6. 6

    Publication

    Composite and per-benchmark results are published on the leaderboard.

How We Evaluate

MAST composite scores are currently equal-weight harmonic means of the clinical dimensions, such that poor performance in any single dimension disproportionately lowers the overall score and cannot be masked by strong performance elsewhere.

The MAST General Composite integrates Diagnostic, Management, Safety, Radiology, and Medical Images; the MAST Clinical Composite integrates Diagnostic, Management, and Safety, without imaging dimensions. Each dimension likewise integrates its constituent benchmarks, detailed below. Where the same benchmark result feeds more than one dimension, it is counted only once, through the dimension where it weighs most, so overlapping membership never double-counts a result.

Agentic task completion is not currently part of the MAST composites. It can be viewed on the Technical Leaderboard as an emerging capability that may be incorporated into future composites.

Any composite or weighting scheme necessarily oversimplifies clinical performance, so we present all underlying scores transparently on the MAST Technical Leaderboard. We are also conducting a study with both consumers and clinicians to empirically derive the weights in a data-driven manner; please stay tuned if you would like to participate.

Clinical Dimensions

MAST scores six clinical-competency dimensions, each drawn from one or more benchmarks. The MAST General Composite combines all but Agentic; the MAST Clinical Composite uses only the three non-imaging reasoning and safety dimensions.

Diagnostic Reasoning

Reaching the correct diagnosis and differential from a clinical presentation.

Benchmarks: Script Concordance Test (diagnostic), CPC-Bench (DDx, QA)

Management Reasoning

Recommending appropriate workup, treatment, and follow-up once the picture is established.

Benchmarks: Script Concordance Test (management), CPC-Bench (management), First Do NOHARM v2

Safety

Avoiding recommendations that would cause clinically significant harm, including harmful actions and critical omissions.

Benchmarks: First Do NOHARM v2

Radiology

Interpreting medical imaging and generating accurate radiology findings.

Benchmarks: ReXrank

Medical Images

Correctly interpreting clinical images such as skin lesions.

Benchmarks: DDI, MIDAS

Agentic Task Completion

Completing multi-step clinical tasks inside real systems (e.g., reading and writing data via FHIR).

Benchmarks: MedAgentBench v2, PhysicianBench (radar-only; not in MAST composites)

Detailed scoring for each benchmark is described in its manuscript.

Human Baselines

To put AI performance in context, MAST develops human physician baselines through our MASTERI platform. These baselines are under active development and, in some cases, established on a per-study basis; some benchmarks, such as agentic tasks, may not have a meaningful human comparison.

Beyond ranking models, these baselines let us identify when systems already exceed average human physician performance, and measure human-AI synergy, such as through our PACT research program. They also surface the pattern MAST is built to detect: models that reason well but underperform physicians on harm avoidance.

What We Don't Measure

MAST is currently scoped to clinical tasks of physicians, and as of yet does not measure regulatory, billing, scribing, robotics, or other workflow tasks.

For the leaderboard, each model is scored under a fixed, standardized prompt rather than one tuned per model, so performance with custom prompting, retrieval augmentation, or fine-tuning may differ from the scores shown here. Operational factors such as latency at scale, cost per query, and EHR integration beyond the FHIR environment used for agentic tasks are also out of scope for the composite. Individual benchmark manuscripts may include additional sensitivity analyses along these dimensions, including prompt engineering, context engineering, and tool use; see our manuscripts for details.

Contamination Prevention

To protect benchmark integrity, MAST balances openness with contamination prevention. For most benchmarks, at least 20% of the evaluation set is open-sourced, while a private held-out set is maintained where possible. Held-out cases are kept in a secure, access-controlled environment, and model providers submit through a controlled pipeline rather than accessing the data directly.

Custom Models and Clinical Products

A growing number of clinical AI tools and products are being launched by industry. We partner with these organizations to evaluate their models through API access and include them on the leaderboard. If you would like your model evaluated, see Submit a Model.

Highly notable, widely used tools may also be evaluated through manual testing when their organizations do not provide API access; these entries are flagged accordingly in our results.

Governance

The dimension structure and membership are maintained by the MAST Steering Committee on a scheduled cycle. Dimensions are intentionally cross-benchmark, so a single benchmark can inform more than one dimension: First Do NOHARM contributes to both Safety and Management, and both CPC-Bench and the Script Concordance Test span diagnostic and management tasks. New benchmarks are considered on a rolling cycle and added only after validation against expert consensus.