Benchmark Demos

Hands-on explorations of how MAST benchmarks evaluate AI in medicine.

First Do NOHARM v2

Explore how models handle clinical safety scenarios and harmful request detection.

Launch Demo ›

First Do NOHARM v2

Explore how models handle clinical safety scenarios and harmful request detection.

Explore ›

SCT-Bench

See how models perform on script concordance tests for clinical reasoning.

Explore ›

MedAgentBench v2

Watch AI agents navigate multi-step clinical workflows in a simulated EHR.

Explore ›

PhysicianBench

Step through a real agent trajectory: an LLM completing a clinical consult in an EHR via FHIR tool calls.

Explore ›

ReXrank Mini

See how vision-language models generate radiology reports from chest X-rays across public datasets.

Explore ›

CPC-Bench

Expert diagnostic reasoning across a century of NEJM clinicopathologic conference cases, plus Dr. CaBot the AI discussant.

Explore ›

First Do NOHARM v2

Explore how models handle clinical safety scenarios and harmful request detection.

Explore ›

SCT-Bench

See how models perform on script concordance tests for clinical reasoning.

Explore ›

MedAgentBench v2

Watch AI agents navigate multi-step clinical workflows in a simulated EHR.

Explore ›

PhysicianBench

Step through a real agent trajectory: an LLM completing a clinical consult in an EHR via FHIR tool calls.

Explore ›

ReXrank Mini

See how vision-language models generate radiology reports from chest X-rays across public datasets.

Explore ›

CPC-Bench

Expert diagnostic reasoning across a century of NEJM clinicopathologic conference cases, plus Dr. CaBot the AI discussant.

Explore ›

First Do NOHARM v2

Explore how models handle clinical safety scenarios and harmful request detection.

Explore ›

SCT-Bench

See how models perform on script concordance tests for clinical reasoning.

Explore ›

MedAgentBench v2

Watch AI agents navigate multi-step clinical workflows in a simulated EHR.

Explore ›

PhysicianBench

Step through a real agent trajectory: an LLM completing a clinical consult in an EHR via FHIR tool calls.

Explore ›

ReXrank Mini

See how vision-language models generate radiology reports from chest X-rays across public datasets.

Explore ›

CPC-Bench

Expert diagnostic reasoning across a century of NEJM clinicopathologic conference cases, plus Dr. CaBot the AI discussant.

Explore ›