We independently benchmark AI models on clinical safety, medical imaging, and reasoning, so you can make informed decisions about the health AI you use.
Ranked by MAST composite score across five clinical benchmarks. Updated April 2026.
Model | ScoreOverall Score | |
|---|---|---|
| 1 | AMBOSS LiSA 1.0 | 66.7% |
| 2 | Gemini 2.5 Flash | 66.1% |
| 3 | Gemini 2.5 Pro | 65.4% |
| 4 | Grok 4 | 64.0% |
| 5 | Grok 4 Fast | 63.4% |
| 6 | Glass Health 4.0 | 62.7% |
| 7 | GPT-5 | 62.6% |
| 8 | Gemini 2.0 Flash | 61.8% |
See how two models perform across all five benchmarks.
| Benchmark | Gap |
|---|---|
| First Do NOHARM v2 | — |
| Script Concordance Test | — |
| MedAgentBench v2 | — |
| ReXrank Mini | — |
| DermBench | — |
The technical leaderboard hosts the full evaluation dataset that powers MAST — granular benchmark results, standard deviations, and the model-vs-model comparisons used by researchers, developers, and organizations evaluating medical AI.
Analyze, audit, and contribute to MAST. Explore the methodology, run evaluations, and help improve medical AI safety benchmarks.
We welcome benchmark submissions via GitHub. All submissions must include a peer-reviewed or pre-print manuscript, a publicly accessible dataset, and reproducible evaluation code. Results should be generated using the official MAST evaluation harness.
Submissions are reviewed by the MAST governance committee for clinical relevance, methodological rigor, and reproducibility. Accepted benchmarks are integrated into the composite score on a quarterly release cycle. See our policies and instructions on GitHub before opening a pull request.