ARISE
ARISE Logo

MAST

Benchmarks

MAST scores are built from multiple clinical benchmarks. Each one tests a different dimension of how AI models handle real medical scenarios.

First, Do NOHARM v2
View demo

First, Do NOHARM v2

NOHARMv2.0

Tests whether AI-generated medical recommendations are safe for patients. Built from real clinical consultations across 10 medical specialties, each AI response is reviewed by board-certified specialists for potential harm. This is the foundational safety benchmark of the MAST suite.

SCT-Bench
View demo

SCT-Bench

SCTv1.0

Challenging cases measuring probabilistic clinical reasoning under uncertainty. The benchmark uses a Script Concordance Testing methodology, a format that assesses how clinicians adjust their diagnostic or therapeutic judgements in response to new, uncertain information.

MedAgentBench v2
View demo

MedAgentBench v2

MedAgentBenchv2.0

Evaluates how well AI agents complete multi-step clinical tasks in a simulated hospital electronic health record. Tasks span ten categories requiring models to search patient data, interpret conditional logic, and carry out actions such as ordering medications or scheduling labs.

ReXrank Mini
View demo

ReXrank Mini

ReXrankv1.0

Evaluates how accurately AI models interpret chest X-ray images and generate clinically appropriate findings. The benchmark tests both radiology report generation and visual question answering, drawing on a multi-site chest radiograph dataset curated by the Harvard Rajpurkar Lab.

CPC-Bench

CPC-Bench

Coming Soon
CPCv1.0

Evaluates AI diagnostic reasoning on complex clinical cases drawn from a century of New England Journal of Medicine Clinicopathological Conferences. The benchmark spans ten physician-validated tasks including differential diagnosis, next-test selection, literature search, and image interpretation.

Multimodal Derm

Multimodal Derm

Coming Soon
Dermv1.0

Multimodal dermatology benchmark evaluating visual and textual reasoning across skin condition assessment tasks.

Independent Academic Research

MAST is developed by ARISE, an independent academic research network. No AI company evaluated in our benchmarks has funding, editorial, or methodological influence over our work.

Editorial process

MAST is operated by the ARISE AI Research Network, an independent academic research collaboration. No AI company evaluated in our benchmarks has funding influence, editorial control, or methodological input over our evaluation processes.

Evaluation schedule

Our evaluation schedule, scoring rubrics, and publication timeline are determined by the MAST Steering Committee. Model providers are notified of results only after scoring is finalized, and they have no opportunity to influence or preview findings before publication.

Stanford MedicineHarvard Medical School

Developers & Contributors

Analyze, audit, and contribute to MAST. Explore the methodology, run evaluations, and help improve medical AI safety benchmarks.

Submission guidelines

We welcome benchmark submissions via GitHub. All submissions must include a peer-reviewed or pre-print manuscript, a publicly accessible dataset, and reproducible evaluation code. Results should be generated using the official MAST evaluation harness.

Review process

Submissions are reviewed by the MAST governance committee for clinical relevance, methodological rigor, and reproducibility. Accepted benchmarks are integrated into the composite score on a quarterly release cycle. See our policies and instructions on GitHub before opening a pull request.