ARISE
ARISE Logo

Medical AI Superintelligence Test (MAST) Leaderboard

We independently benchmark AI models on clinical safety, medical imaging, and reasoning, so you can make informed decisions about the health AI you use.

Which AI can you trust for medical questions?

Ranked by MAST composite score across five clinical benchmarks. Updated April 2026.

Model
Score
1
AMBOSS LiSA 1.0
AMBOSSAMBOSS
66.7%
2
Gemini 2.5 Flash
GoogleGoogle
66.1%
3
Gemini 2.5 Pro
GoogleGoogle
65.4%
4
Grok 4
xAIxAI
64.0%
5
Grok 4 Fast
xAIxAI
63.4%
6
Glass Health 4.0
Glass HealthGlass Health
62.7%
7
GPT-5
OpenAIOpenAI
62.6%
8
Gemini 2.0 Flash
GoogleGoogle
61.8%

Compare models

See how two models perform across all five benchmarks.

BenchmarkGap
First Do NOHARM v2
Script Concordance Test
MedAgentBench v2
ReXrank Mini
DermBench

Looking for more technical information?

The technical leaderboard hosts the full evaluation dataset that powers MAST — granular benchmark results, standard deviations, and the model-vs-model comparisons used by researchers, developers, and organizations evaluating medical AI.

Developers & Contributors

Analyze, audit, and contribute to MAST. Explore the methodology, run evaluations, and help improve medical AI safety benchmarks.

Submission guidelines

We welcome benchmark submissions via GitHub. All submissions must include a peer-reviewed or pre-print manuscript, a publicly accessible dataset, and reproducible evaluation code. Results should be generated using the official MAST evaluation harness.

Review process

Submissions are reviewed by the MAST governance committee for clinical relevance, methodological rigor, and reproducibility. Accepted benchmarks are integrated into the composite score on a quarterly release cycle. See our policies and instructions on GitHub before opening a pull request.