Medical AI Superintelligence Test

Welcome to the MAST project. We're a multi-institutional collaboration independently evaluating medical AI. We curate robust, realistic benchmarks in areas including clinical reasoning, safety, and medical images, so you can make informed decisions about the health AI you use.

Which AI can you trust for medical questions?

We measure performance for general medical AI usage as well as clinician-focused usage. Last updated June 30, 2026

The latest large and small model from each provider, ranked on diagnostic and management reasoning, safety, radiology, and medical images.

A clinician-focused ranking including medical-specialized models, scored on diagnostic and management reasoning and safety.

The latest large and small model from each provider, ranked on diagnostic and management reasoning, safety, radiology, and medical images.

	Model	Organization	ScoreComposite
1	GPT-5.5 OpenAI	OpenAI	62.1%
2	Claude Opus 4.7 Anthropic	Anthropic	59.2%
3	Gemini 3.1 Pro Google	Google	58.0%
4	Gemini 3.5 Flash Google	Google	58.0%
5	Kimi K2.5 Moonshot AI	Moonshot AI	55.6%
6	Grok 4 Fast xAI	xAI	54.5%
7	Grok 4 xAI	xAI	54.2%
8	GPT-5.4 mini OpenAI	OpenAI	54.1%

Explore full dataset on technical leaderboard ↗

Compare models

How two models compare across all benchmark dimensions.

Dimension	Gap
Diagnostic Reasoning	—
Management Reasoning	—
Safety	—
Multimodal Images	—
Multimodal Radiology	—
Agentic	—

Looking for more technical information?

The interactive technical leaderboard hosts the full dataset that powers MAST, with granular benchmark results for researchers, developers, organizations, and labs.

Technical leaderboard

Explore full dataset of scores across benchmarks.

Visit technical leaderboard ↗

Researchers & Developers

View the public codebase, explore datasets, and run your own models.

GitHub

Clone the repo and run your own evals. View our submission guidelines.

View on GitHub ↗