Evals

Apply: Stanford AI in Healthcare Leadership & Strategy Program (May–June 2026)

MAST

Overview Benchmarks Evals Methodology Governance Transparency About

MAST evaluations are designed to go beyond standard accuracy metrics. Each benchmark in the suite measures distinct dimensions of clinical competence — from diagnostic safety and therapeutic appropriateness to probabilistic reasoning under uncertainty. Evaluation protocols are developed in collaboration with board-certified physicians and validated against expert consensus to ensure clinical relevance.

Safety & Harm Avoidance

Measures whether AI-generated recommendations could lead to patient harm. Evaluates contraindication detection, dosage safety, and adherence to clinical guidelines across specialties.

Diagnostic Accuracy

Assesses the ability to arrive at correct diagnoses given clinical presentations, lab results, and imaging findings. Scored against expert panel consensus using standardized rubrics.

Clinical Reasoning

Evaluates probabilistic reasoning and the capacity to update clinical judgments when presented with new, ambiguous, or conflicting information — mirroring real-world decision-making.

Multimodal Comprehension

Tests interpretation of clinical images, pathology slides, and radiology findings alongside textual patient data to evaluate end-to-end clinical task performance.

Suggest an Eval

Have a reproducible safety gap or workflow check? Propose it here.

Submit suggestion

Healthcare AI Company?

Want your product independently evaluated by verified clinicians? We offer private evaluation engagements and public leaderboard inclusion.

Get in touch

All evaluations are blinded, clinician-governed, and methodologically independent.

Join us in shaping the future of
healthcare with AI

Mailing List Signup