ARISE
ARISE Logo

MAST

Methodology

MAST employs a rigorous, multi-stage evaluation methodology grounded in established medical education and assessment frameworks. Each benchmark undergoes expert validation, adversarial testing, and statistical calibration to ensure that results are reproducible, clinically meaningful, and resistant to shortcut learning.

Executive Summary

MAST (Medical AI Superintelligent Testing) is a comprehensive evaluation framework designed to measure whether AI systems can safely and accurately support clinical decision-making. Unlike traditional medical AI benchmarks that focus narrowly on diagnostic accuracy, MAST evaluates the full spectrum of clinical competence — from harm avoidance and therapeutic appropriateness to probabilistic reasoning under uncertainty.

Built on real-world clinical data from over 16,000 eConsults across 10 medical specialties, MAST uses expert physician annotations and a WHO-aligned harm taxonomy to quantify the safety profile of AI-generated medical recommendations. The framework establishes human physician baselines and tests multiple AI agent configurations to provide a rigorous, clinically meaningful assessment of medical AI readiness.

How We Evaluate

MAST evaluations are grounded in the electronic consultation (eConsult) format — a real-world clinical workflow where primary care physicians seek specialist guidance for patient cases. We curate 100 representative cases drawn from a corpus of 16,399 de-identified eConsults spanning 10 medical specialties: cardiology, dermatology, endocrinology, gastroenterology, hematology, infectious disease, nephrology, neurology, pulmonology, and rheumatology.

Each case is presented to AI systems in three distinct agent configurations: Solo (single model generating recommendations independently), Advisor (model augmented with retrieval-based clinical references), and Guardian (multi-agent setup with a safety-checking layer). This multi-configuration approach reveals how architectural choices affect clinical safety and accuracy.

Model responses are evaluated by board-certified specialists using standardized rubrics that assess both the quality of clinical recommendations and the potential for patient harm.

Our Metrics

MAST employs a comprehensive set of metrics designed to capture distinct dimensions of clinical AI performance. Each metric is defined with a precise formula and clinical rationale.

Safety Score

Overall measure of harm avoidance across all evaluated cases. Represents the proportion of recommendations that do not introduce clinically significant harm.

Completeness

Measures whether the AI response addresses all clinically relevant aspects of the consultation, including differential diagnosis, workup, and management plan.

Restraint

Quantifies the model's tendency to avoid overstepping its scope — not recommending unnecessary tests, treatments, or referrals beyond what is clinically indicated.

Precision

The proportion of AI-generated recommendations that are clinically correct and appropriate for the given case.

Recall

The proportion of clinically necessary recommendations that the AI system successfully identifies and includes.

Escalation Rate

Frequency at which the AI appropriately identifies cases requiring urgent specialist attention or emergency intervention.

Case Harm Rate

The proportion of evaluated cases where the AI response would introduce clinically significant harm if followed.

Number Needed to Harm

The number of AI consultations needed before one case of clinically significant harm occurs. Higher values indicate safer systems.

Expert Annotation Process

Our evaluation dataset includes 12,747 specialist annotations collected from board-certified physicians across all 10 specialties. Each AI-generated recommendation is reviewed by at least two independent specialists, with a third adjudicator resolving disagreements.

Inter-rater reliability is measured using Cohen's kappa and intraclass correlation coefficients, with minimum thresholds required for inclusion in final scoring. Annotations follow a WHO-aligned harm taxonomy with five severity levels: no harm, mild (temporary discomfort), moderate (prolonged recovery), severe (life-threatening), and death.

Annotators assess both errors of commission (harmful recommendations actively given) and errors of omission (critical recommendations that were missed). This dual assessment ensures that models are evaluated not just on what they say, but on what they fail to say.

Human Baselines

To contextualize AI performance, MAST establishes human physician baselines using the same evaluation framework. Generalist physicians (non-specialists) achieve a baseline safety score of 46.0% on the NOHARM benchmark, reflecting the inherent difficulty of specialist-level clinical reasoning.

This baseline was established by having board-certified primary care physicians respond to the same eConsult cases presented to AI systems, with their responses evaluated by the same specialist panels using identical rubrics. The generalist baseline serves as a critical reference point: AI systems that consistently outperform generalist physicians on safety metrics may offer genuine clinical value as decision-support tools.

What We Don't Measure

MAST is designed to be transparent about its scope and limitations. The current benchmark suite does not evaluate the following areas, which represent important directions for future work:

  • Medical imaging interpretation (radiology, pathology slides)
  • Longitudinal care management and follow-up planning
  • Pediatric-specific clinical scenarios
  • Non-English clinical communication
  • Patient-facing communication and shared decision-making
  • Emergency and time-critical clinical decision-making
  • Surgical planning and procedural guidance

Contamination Prevention

To ensure benchmark integrity, MAST implements a data gating approach to contamination prevention. Benchmark cases are held in a secure, access-controlled repository and are not published in any publicly crawlable format. Model providers must submit their systems for evaluation through our controlled pipeline rather than accessing test data directly.

We additionally perform retrospective contamination checks by comparing model outputs against training data disclosures and testing for memorization patterns. Cases showing evidence of contamination are flagged and excluded from scoring. The benchmark dataset is periodically refreshed with new cases to maintain evaluation validity over time.

Evaluation Pipeline

1

Case

Real eConsult clinical case selected from curated corpus

2

Model Response

AI system generates clinical recommendations

3

Expert Annotation

Board-certified specialists review and score

4

Harm Scoring

WHO-aligned taxonomy applied to identify harm severity

5

Metrics

Safety, completeness, and accuracy scores computed

Overview

The MAST benchmark is governed by an independent academic structure designed to ensure rigorous, unbiased evaluation of medical AI systems. Our governance model separates strategic oversight from day-to-day operations, with clear accountability at every level.

All governance decisions — including benchmark selection, scoring methodology, and publication policy — are made through a transparent, consensus-driven process. No single institution or individual holds unilateral authority over MAST outcomes.

Steering Committee

The MAST Steering Committee provides strategic direction and ensures the benchmark maintains the highest standards of scientific rigor and clinical relevance. Committee members are drawn from leading academic medical centers and represent diverse clinical specialties.

The committee meets quarterly to review benchmark performance, approve new evaluation domains, and address any methodological concerns. All committee deliberations are documented and key decisions are published in our transparency reports.

Committee membership is rotated on a three-year cycle to ensure fresh perspectives and prevent institutional capture. Members must disclose all potential conflicts of interest upon appointment and annually thereafter.

Team Members

The MAST operational team is composed of clinician-scientists, AI researchers, biostatisticians, and medical educators from institutions across the ARISE Network. Each team member brings specialized expertise essential to maintaining evaluation quality.

Our annotation workforce consists of board-certified physicians across 10 medical specialties who undergo standardized training on our scoring rubrics before participating in evaluations. All annotators are credentialed and their qualifications are verified independently.

Technical infrastructure is maintained by a dedicated engineering team responsible for the evaluation pipeline, data security, and leaderboard operations.

Methodology and Weighting

MAST composite scores are calculated using a weighted aggregation across constituent benchmarks. Weights are determined by the Steering Committee based on clinical importance, task diversity, and methodological maturity of each benchmark component.

The current weighting framework prioritizes patient safety metrics above diagnostic accuracy, reflecting the principle that harm avoidance is the foundational requirement for clinical AI deployment. Specific weight allocations are published alongside each benchmark release.

Weighting decisions are reviewed annually and may be adjusted as new clinical evidence emerges or as the benchmark suite expands. Any changes to the weighting methodology are announced at least 60 days before taking effect, with a public comment period for stakeholder feedback.

Disclosures and Conflicts

All individuals involved in MAST governance, evaluation, and operations are required to disclose any financial relationships, advisory roles, or equity holdings with AI companies whose products are or may be evaluated by the benchmark.

Disclosed conflicts are reviewed by an independent ethics officer. Individuals with material conflicts are recused from scoring, methodology decisions, or publication review related to the conflicted entity. Recusals are documented and reported in our annual transparency disclosures.

MAST does not accept direct funding from AI companies for benchmark development or evaluation operations. Institutional funding sources are disclosed publicly and reviewed annually to ensure no indirect conflicts compromise evaluation integrity.

Funding Sources

Transparency in funding is essential to maintaining public trust in our evaluations. The table below discloses all funding sources that support MAST development and operations.

SourceTypePeriodPurpose
Stanford MedicineInstitutional2024–PresentCore research infrastructure and personnel
Harvard Medical SchoolInstitutional2024–PresentClinical validation and annotation support
NIH/NIDDKFederal Grant2024–2026Benchmark development and data curation

Conflict of Interest Policy

ARISE maintains strict conflict of interest policies to protect the integrity of MAST evaluations. The following rules apply to all team members, advisors, and collaborators involved in the benchmark process:

  • AI companies cannot fund or sponsor specific benchmark evaluations or influence evaluation scheduling.
  • Team members with financial affiliations to any evaluated AI company must recuse themselves from scoring that company's submissions.
  • All advisory board members must disclose potential conflicts of interest, which are reviewed annually and published on our website.
  • Evaluation rubrics and scoring criteria are locked before any model submission is evaluated and cannot be modified retroactively.
  • External audit of our evaluation process is conducted annually by an independent academic review committee.

Data Use Policy

During evaluation, model providers submit their systems through our controlled API pipeline. MAST does not share benchmark cases with model providers before or after evaluation. All evaluation data is processed in a secure environment, and model outputs are stored only for the duration needed to complete scoring.

De-identified clinical cases used in the benchmark are sourced from existing institutional research datasets with appropriate IRB approvals. No patient-identifiable information is included in any benchmark case. Model providers' API keys and system configurations are handled under standard data protection protocols and are not retained after evaluation completion.

Contact

For questions about our transparency practices, funding disclosures, or conflict of interest policies, please reach out to our transparency team.

transparency@arise-ai.org

Join us in shaping the future of
healthcare with AI

Mailing List Signup