Medical AI Superintelligence Test (MAST) Leaderboard
The MAST project seeks to curate a centralized resource of robust and realistic clinical benchmarks to measure the performance of medical AI.
See our methodology and submission instructions.
Dec 2025
Feb 2026
Mar 2026
Apr 2026
CPC-BenchMultimodal Derm
May 2026
~Jul 2026
In Development
NOHARM-Mind
~H2 2026 – 2027
In Development
PACT: 12 high-risk clinical reasoning benchmarks
Benchmark Demos
View all ›Preview: MAST is currently in preview. Exact scores on this benchmark may change as we undergo final validation and tuning.
First Do NOHARM v2 overall metric across 39 models
Benchmark Comparison
Overall scores across benchmarks
R² = 0.30
Model Profiles
Compare models across benchmarks
Not all models could be run on every benchmark; axes with no result are shown at the center.
Performance Over Time
No scored data for the current selection.
No scored data for the current selection.
Preview: MAST is currently in preview. Exact scores on this benchmark may change as we undergo final validation and tuning.
Model Leaderboard
Top 10 shown
| # | Model | Reasoning↓ | Safety | Agentic | Images | Multimodal | CPC | Diagnostic | Management | Radiology |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.5 | 76.0%±2.0 | 73.8%±3.9 | 51.4%±4.1 | 42.9%±2.8 | 49.2%±1.8 | 86.2%±3.8 | 75.1%±3.2 | 76.3%±2.4 | 57.6%±0.0 |
| 2 | GPT-5.2 | 73.2%±2.1 | 72.6%±3.9 | 27.2%±10.4 | 46.5%±2.7 | 49.3%±1.5 | 82.8%±3.9 | 72.6%±3.4 | 72.4%±2.5 | 52.4%±0.0 |
| 3 | GPT-5 | 73.2%±2.3 | 69.0%±3.8 | 32.4%±10.9 | 44.2%±2.5 | 47.6%±1.5 | 80.3%±4.7 | 73.1%±4.1 | 73.9%±2.6 | 51.6%±0.0 |
| 4 | GPT-5.4 | 72.2%±2.2 | 73.0%±3.9 | 38.8%±5.1 | 46.5%±2.5 | 47.1%±1.3 | 82.5%±4.0 | 71.9%±3.4 | 70.8%±2.8 | 47.8%±0.0 |
| 5 | Claude Opus 4.7 | 71.5%±2.5 | 71.1%±3.8 | 40.6%±5.0 | 42.7%±2.3 | 48.0%±1.5 | 76.7%±4.9 | 75.4%±3.6 | 69.2%±3.5 | 54.7%±0.0 |
| 6 | Claude Opus 4.6 | 68.5%±2.6 | 64.0%±4.4 | 44.4%±5.3 | 40.4%±2.7 | 42.7%±1.5 | 76.0%±5.0 | 75.4%±3.7 | 66.1%±3.5 | 45.2%±0.0 |
| 7 | GPT-5 mini | 67.9%±2.2 | 65.0%±3.6 | 19.0%±10.7 | 43.2%±2.4 | 46.1%±1.4 | 77.9%±4.7 | 68.2%±3.8 | 67.0%±2.7 | 49.4%±0.0 |
| 8 | Claude Sonnet 4.6 | 67.3%±2.5 | 63.9%±4.2 | 35.0%±5.5 | 39.0%±2.8 | 39.7%±1.5 | 76.2%±4.8 | 72.7%±3.5 | 65.1%±3.4 | 40.4%±0.0 |
| 9 | Kimi K2.6OSS | 67.1%±2.6 | 66.4%±4.3 | -- | -- | -- | 73.9%±5.5 | 73.1%±4.0 | 63.8%±3.8 | -- |
| 10 | GPT-4.1 | 65.5%±2.7 | 58.5%±4.7 | -- | 40.8%±2.5 | 46.7%±1.6 | 71.8%±5.2 | 71.4%±4.3 | 64.5%±3.3 | 54.6%±0.0 |
Not all models could be run on every benchmark; blank (NA) cells indicate no result, not a zero score.