Benchmark Demos
Hands-on explorations of how MAST benchmarks evaluate AI in medicine.

First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Launch Demo ›First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Explore ›SCT-Bench
See how models perform on script concordance tests for clinical reasoning.
Explore ›MedAgentBench v2
Watch AI agents navigate multi-step clinical workflows in a simulated EHR.
Explore ›PhysicianBench
Step through a real agent trajectory: an LLM completing a clinical consult in an EHR via FHIR tool calls.
Explore ›ReXrank Mini
See how vision-language models generate radiology reports from chest X-rays across public datasets.
Explore ›
CPC-Bench
Expert diagnostic reasoning across a century of NEJM clinicopathologic conference cases, plus Dr. CaBot the AI discussant.
Explore ›First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Explore ›SCT-Bench
See how models perform on script concordance tests for clinical reasoning.
Explore ›MedAgentBench v2
Watch AI agents navigate multi-step clinical workflows in a simulated EHR.
Explore ›PhysicianBench
Step through a real agent trajectory: an LLM completing a clinical consult in an EHR via FHIR tool calls.
Explore ›ReXrank Mini
See how vision-language models generate radiology reports from chest X-rays across public datasets.
Explore ›
CPC-Bench
Expert diagnostic reasoning across a century of NEJM clinicopathologic conference cases, plus Dr. CaBot the AI discussant.
Explore ›First Do NOHARM v2
Explore how models handle clinical safety scenarios and harmful request detection.
Explore ›SCT-Bench
See how models perform on script concordance tests for clinical reasoning.
Explore ›MedAgentBench v2
Watch AI agents navigate multi-step clinical workflows in a simulated EHR.
Explore ›PhysicianBench
Step through a real agent trajectory: an LLM completing a clinical consult in an EHR via FHIR tool calls.
Explore ›ReXrank Mini
See how vision-language models generate radiology reports from chest X-rays across public datasets.
Explore ›
CPC-Bench
Expert diagnostic reasoning across a century of NEJM clinicopathologic conference cases, plus Dr. CaBot the AI discussant.
Explore ›