MAST
MAST scores are built from multiple clinical benchmarks. Each one tests a different dimension of how AI models handle real medical scenarios.

Tests whether AI-generated medical recommendations are safe for patients. Built from real clinical consultations across many medical specialties, each AI response is reviewed by board-certified specialists for potential harm. This is the core safety benchmark of the MAST suite.

Evaluates AI diagnostic reasoning on complex clinical cases drawn from a century of New England Journal of Medicine Clinicopathological Conferences. The benchmark spans multiple tasks including differential diagnosis, next-test selection, literature search, and image interpretation.

Challenging cases measuring probabilistic clinical reasoning under uncertainty. The benchmark uses a Script Concordance Testing methodology, a format that assesses how clinicians adjust their diagnostic or therapeutic judgments in response to new, uncertain information.

Evaluates how accurately AI models interpret chest X-ray images and generate clinically appropriate findings, testing both radiology report generation and visual question answering. ReXrank Mini is a curated subset of the full ReXrank benchmark, built on a multi-site chest radiograph dataset from the Harvard Rajpurkar Lab.
View the public codebase, explore datasets, and run your own models.