ARISE
ARISE Logo

MAST

Benchmarks

MAST scores are built from multiple clinical benchmarks. Each one tests a different dimension of how AI models handle real medical scenarios.

First Do NOHARM v2
View demo

First Do NOHARM v2

NOHARMv2.0
SafetyManagement Reasoning

Tests whether AI-generated medical recommendations are safe for patients. Built from real clinical consultations across many medical specialties, each AI response is reviewed by board-certified specialists for potential harm. This is the core safety benchmark of the MAST suite.

CPC-Bench
View demo

CPC-Bench

CPCv1.0
Diagnostic ReasoningManagement Reasoning

Evaluates AI diagnostic reasoning on complex clinical cases drawn from a century of New England Journal of Medicine Clinicopathological Conferences. The benchmark spans multiple tasks including differential diagnosis, next-test selection, literature search, and image interpretation.

PhysicianBench
View demo

PhysicianBench

PhysBenchv1.0
Agentic

Tests AI agents on 100 long-horizon physician tasks adapted from real consultations across 21 specialties. Agents work in an EHR environment with real patient records over standard FHIR APIs, scored against 670 execution-verified checkpoints.

Script Concordance Test
View demo

Script Concordance Test

SCTv1.0
Diagnostic ReasoningManagement Reasoning

Challenging cases measuring probabilistic clinical reasoning under uncertainty. The benchmark uses a Script Concordance Testing methodology, a format that assesses how clinicians adjust their diagnostic or therapeutic judgments in response to new, uncertain information.

MedAgentBench v2
View demo

MedAgentBench v2

MedAgentBenchv2.0
Agentic

Tests AI agents on multi-step tasks in a simulated hospital electronic health record. Tasks require agents to search patient data, follow conditional logic, and take actions such as ordering medications or scheduling labs.

ReXrank Mini
View demo

ReXrank Mini

ReXrankv1.0
Multimodal Radiology

Evaluates how accurately AI models interpret chest X-ray images and generate clinically appropriate findings, testing both radiology report generation and visual question answering. ReXrank Mini is a curated subset of the full ReXrank benchmark, built on a multi-site chest radiograph dataset from the Harvard Rajpurkar Lab.

Multimodal Images

Multimodal Images

Imagesv1.0
Multimodal Images

Tests multimodal AI models on skin-lesion assessment, combining the Diverse Dermatology Images (DDI) and MRA-MIDAS datasets of clinical and dermoscopic images with histopathologic confirmation.

Researchers & Developers

View the public codebase, explore datasets, and run your own models.