Medical AI Superintelligence Test (MAST) Leaderboard

The MAST project seeks to curate a centralized resource of robust and realistic clinical benchmarks to measure the performance of medical AI.

Dec 2025

Feb 2026

Mar 2026

Apr 2026

CPC-BenchMultimodal Derm

May 2026

~Jul 2026

In Development

NOHARM-Mind

~H2 2026 – 2027

In Development

PACT: 12 high-risk clinical reasoning benchmarks

Preview: MAST is currently in preview. Exact scores on this benchmark may change as we undergo final validation and tuning.

First Do NOHARM v2 overall metric across 45 models

Overall scores across benchmarks

Y AxisX Axis

R² = 0.25

Compare models across benchmarks

Model 1Model 2Model 3

Not all models could be run on every benchmark; axes with no result are shown at the center.

Performance Over Time

No scored data for the current selection.

Preview: MAST is currently in preview. Exact scores on this benchmark may change as we undergo final validation and tuning.

Top 10 shown

#	Model	Reasoning↓	Safety	Agentic	Images	Multimodal	CPC	Diagnostic	Management	Radiology
1	GPT-5.5	75.4%±2.3	73.4%±4.6	56.7%±4.6	42.9%±2.8	49.2%±1.8	86.2%±3.8	75.1%±3.2	76.2%±2.6	57.6%±0.0
2	GPT-5	72.1%±2.6	67.2%±5.1	32.4%±10.9	44.2%±2.5	47.6%±1.5	80.3%±4.7	73.1%±4.1	73.2%±2.8	51.6%±0.0
3	GPT-5.2	71.8%±2.3	70.6%±4.7	27.2%±10.3	46.5%±2.7	49.3%±1.5	82.8%±3.9	72.6%±3.4	71.7%±2.9	52.4%±0.0
4	GPT-5.4	71.3%±2.2	72.4%±4.4	38.8%±5.2	46.5%±2.5	47.1%±1.3	82.5%±4.0	71.9%±3.4	70.6%±2.8	47.8%±0.0
5	Claude Opus 4.7	70.1%±2.7	67.2%±5.0	40.6%±5.0	42.7%±2.3	48.0%±1.5	76.7%±4.9	75.4%±3.6	67.9%±3.5	54.7%±0.0
6	Kimi K2.6OSS	66.3%±2.9	63.9%±5.0	--	--	--	73.9%±5.5	73.1%±4.0	63.1%±3.8	--
7	Claude Opus 4.6	66.1%±3.0	57.9%±5.4	44.4%±5.4	40.4%±2.7	42.7%±1.5	76.0%±5.0	75.4%±3.7	63.8%±3.7	45.2%±0.0
8	GPT-5 mini	66.0%±2.4	62.1%±4.9	19.0%±10.7	43.2%±2.4	46.1%±1.4	77.9%±4.7	68.2%±3.8	65.9%±2.9	49.4%±0.0
9	Claude Sonnet 4.6	64.8%±2.9	57.7%±5.4	35.0%±5.7	39.0%±2.8	39.7%±1.5	76.2%±4.8	72.7%±3.5	62.9%±3.5	40.4%±0.0
10	Kimi K2.5OSS	63.4%±2.9	60.7%±5.0	11.1%±8.3	41.7%±2.8	47.0%±1.8	68.7%±5.5	69.3%±4.3	60.8%±4.0	53.8%±0.0

Not all models could be run on every benchmark; blank (NA) cells indicate no result, not a zero score.