ARISE
ARISE Logo

MAST

Medical AI
Superintelligent Testing

The MAST project seeks to curate the most robust and realistic clinical benchmarks to measure the performance of medical AI.

Leaderboard

NOHARM composite scores across evaluated models

#ModelComposite
1
AMBOSS LiSA 1.0
AMBOSS
66.7
2
Gemini 2.5 Flash
Google
66.1
3
Gemini 2.5 Pro
Google
65.4
4
DeepSeek V3.1OPEN
DeepSeek
65.1
5
Kimi K2OPEN
Moonshot AI
64.7
6
DeepSeek R1OPEN
DeepSeek
64.4
7
Grok 4
xAI
64.0
8
Grok 4 Fast
xAI
63.4
9
Glass Health 4.0
Glass Health
62.7
10
GPT-5
OpenAI
62.6
11
Gemini 2.0 Flash
Google
61.8
12
Claude Sonnet 4.5
Anthropic
61.7
13
GPT-4o
OpenAI
61.6
14
Claude 3.7 Sonnet
Anthropic
61.4
15
GPT-4.1
OpenAI
60.9

Scores are percentages (0–100). Composite is the unweighted mean of all metrics. Based on the NOHARM benchmark evaluation.

Benchmarks

View All

Currently we support First Do NOHARM, Script Concordance Test (SCT-Bench), CPC-Bench, and MedAgentBench, with more benchmarks in our roadmap. Widely-used industry benchmarks (OpenAI HealthBench) are included for reference, but are not weighed in the composite score. See our policies and submission instructions.

First, Do NOHARM

First, Do NOHARM

NOHARMv1.0.0
Evaluated on: 2026-01-15

The foundational benchmark of the MAST suite, and establishes a new framework to assess clinical safety and accuracy in AI-generated medical recommendations.

Script Concordance Testing (SCT)

Script Concordance Testing (SCT)

Coming Soon
SCTv1.0.0
Evaluated on: 2026-01-10

Challenging cases measuring probabilistic clinical reasoning under uncertainty. The benchmark uses a Script Concordance Testing (SCT) methodology, which is a format that assesses how clinicians adjust their diagnostic or therapeutic judgments in response to new, uncertain information.

HealthBench

HealthBench

Broad agentic evaluation suite designed to test the capabilities of large language models specifically within the context of medical records. It features 300 clinically-derived, patient-specific tasks, categorized across 10 different areas.

CPC Bench

CPC Bench

Coming Soon
CPCv1.0.0
Evaluated on: 2026-01-05

Comprehensive benchmark designed to evaluate medical artificial intelligence across a century of complex clinical cases from the New England Journal of Medicine. Includes both multimodal (2,505) and text-based (3,364) tasks.

Independent Academic Research

MAST is operated by the ARISE AI Research Network, an independent academic research collaboration. No AI company evaluated in our benchmarks has funding influence, editorial control, or methodological input over our evaluation processes.

Our evaluation schedule, scoring rubrics, and publication timeline are determined by the MAST Steering Committee. Model providers are notified of results only after scoring is finalized, and they have no opportunity to influence or preview findings before publication.

Stanford Medicine
Harvard

MAST is developed by ARISE, an independent academic research network. No AI company evaluated in our benchmarks has funding, editorial, or methodological influence over our work.

People

The researchers and clinicians driving MAST forward.

Acknowledgements

Ethan Goh, Adam Rodman, Jonathan H Chen, Jason Hom, Eric Horvitz, Neera Ahuja, Andrew Parsons, Andrew Olson

Developers & Contributors

Analyze, audit, and contribute to MAST. Explore the methodology, run evaluations, and help improve medical AI safety benchmarks.

Submission guidelines

We welcome benchmark submissions via GitHub. All submissions must include a peer-reviewed or pre-print manuscript, a publicly accessible dataset, and reproducible evaluation code. Results should be generated using the official MAST evaluation harness.

Review process

Submissions are reviewed by the MAST governance committee for clinical relevance, methodological rigor, and reproducibility. Accepted benchmarks are integrated into the composite score on a quarterly release cycle. See our policies and instructions on GitHub before opening a pull request.

Join us in shaping the future of
healthcare with AI

Mailing List Signup