MAST
The MAST project seeks to curate the most robust and realistic clinical benchmarks to measure the performance of medical AI.
NOHARM composite scores across evaluated models
| # | Model | Composite |
|---|---|---|
| 1 | AMBOSS LiSA 1.0 AMBOSS | 66.7 |
| 2 | Gemini 2.5 Flash Google | 66.1 |
| 3 | Gemini 2.5 Pro Google | 65.4 |
| 4 | DeepSeek V3.1OPEN DeepSeek | 65.1 |
| 5 | Kimi K2OPEN Moonshot AI | 64.7 |
| 6 | DeepSeek R1OPEN DeepSeek | 64.4 |
| 7 | Grok 4 xAI | 64.0 |
| 8 | Grok 4 Fast xAI | 63.4 |
| 9 | Glass Health 4.0 Glass Health | 62.7 |
| 10 | GPT-5 OpenAI | 62.6 |
| 11 | Gemini 2.0 Flash Google | 61.8 |
| 12 | Claude Sonnet 4.5 Anthropic | 61.7 |
| 13 | GPT-4o OpenAI | 61.6 |
| 14 | Claude 3.7 Sonnet Anthropic | 61.4 |
| 15 | GPT-4.1 OpenAI | 60.9 |
Scores are percentages (0–100). Composite is the unweighted mean of all metrics. Based on the NOHARM benchmark evaluation.
Currently we support First Do NOHARM, Script Concordance Test (SCT-Bench), CPC-Bench, and MedAgentBench, with more benchmarks in our roadmap. Widely-used industry benchmarks (OpenAI HealthBench) are included for reference, but are not weighed in the composite score. See our policies and submission instructions.
MAST is operated by the ARISE AI Research Network, an independent academic research collaboration. No AI company evaluated in our benchmarks has funding influence, editorial control, or methodological input over our evaluation processes.
Our evaluation schedule, scoring rubrics, and publication timeline are determined by the MAST Steering Committee. Model providers are notified of results only after scoring is finalized, and they have no opportunity to influence or preview findings before publication.


MAST is developed by ARISE, an independent academic research network. No AI company evaluated in our benchmarks has funding, editorial, or methodological influence over our work.
The researchers and clinicians driving MAST forward.
Ethan Goh, Adam Rodman, Jonathan H Chen, Jason Hom, Eric Horvitz, Neera Ahuja, Andrew Parsons, Andrew Olson
Analyze, audit, and contribute to MAST. Explore the methodology, run evaluations, and help improve medical AI safety benchmarks.
We welcome benchmark submissions via GitHub. All submissions must include a peer-reviewed or pre-print manuscript, a publicly accessible dataset, and reproducible evaluation code. Results should be generated using the official MAST evaluation harness.
Submissions are reviewed by the MAST governance committee for clinical relevance, methodological rigor, and reproducibility. Accepted benchmarks are integrated into the composite score on a quarterly release cycle. See our policies and instructions on GitHub before opening a pull request.