MAST
Welcome to the MAST project. We're a multi-institutional collaboration independently evaluating medical AI. We curate robust, realistic benchmarks in areas including clinical reasoning, safety, and medical images, so you can make informed decisions about the health AI you use.
We measure performance for general medical AI usage as well as clinician-focused usage. Last updated June 15, 2026
The latest large and small model from each provider, ranked on diagnostic and management reasoning, safety, radiology, and medical images.
A clinician-focused ranking including medical-specialized models, scored on diagnostic and management reasoning and safety.
The latest large and small model from each provider, ranked on diagnostic and management reasoning, safety, radiology, and medical images.
Model | ScoreComposite | |
|---|---|---|
| 1 | GPT-5.5 | 62.1% |
| 2 | Claude Opus 4.7 | 59.2% |
| 3 | Gemini 3.1 Pro | 58.0% |
| 4 | Gemini 3.5 Flash | 58.0% |
| 5 | Kimi K2.5 | 55.6% |
| 6 | Grok 4 Fast | 54.5% |
| 7 | Grok 4 | 54.2% |
| 8 | GPT-5.4 mini | 54.1% |
How two models compare across all benchmark dimensions.
| Dimension | Gap |
|---|---|
| Diagnostic Reasoning | — |
| Management Reasoning | — |
| Safety | — |
| Multimodal Images | — |
| Multimodal Radiology | — |
| Agentic | — |
The interactive technical leaderboard hosts the full dataset that powers MAST, with granular benchmark results for researchers, developers, organizations, and labs.
View the public codebase, explore datasets, and run your own models.