MAST
The Medical AI Superintelligence Test is an independent effort run by the ARISE AI Research Network to curate the most robust and realistic clinical benchmarks to measure the performance of medical AI. MAST exists to ensure that AI entering healthcare is rigorously tested, independently validated, and held to the highest clinical standards before it reaches patients.

To establish an open, evidence-based evaluation framework that holds medical AI to the highest clinical standards — ensuring that deployed systems help rather than harm patients. We believe rigorous, independent benchmarking is the foundation of safe AI adoption in healthcare.
We evaluate AI systems the way medicine evaluates treatments: with blinded assessments, expert panels, standardized rubrics, and transparent methodology. Every benchmark in the MAST suite is designed by board-certified physicians, validated against clinical consensus, and resistant to data contamination or shortcut learning.
MAST is developed by a multidisciplinary team of clinicians, AI researchers, biostatisticians, and medical educators from ARISE, an independent academic collaborative spanning Stanford Medicine, Harvard Medical School, and partner institutions.
The MAST Steering Committee provides strategic direction, approves new evaluation domains, and sets the weighting methodology for the composite score. Committee members are drawn from leading academic medical centers and represent diverse clinical specialties. Membership rotates on a three-year cycle, and members must disclose all potential conflicts of interest upon appointment and annually thereafter.
The annotation workforce consists of board-certified physicians across 10 medical specialties who undergo standardized training on scoring rubrics before participating in evaluations. Technical infrastructure is maintained by a dedicated engineering team responsible for the evaluation pipeline, data security, and leaderboard operations.
Meet the full teamMAST does not accept direct funding from AI companies for benchmark development or evaluation. Institutional funding sources are disclosed publicly and reviewed annually.
| Source | Type | Period | Purpose |
|---|---|---|---|
| Stanford Medicine | Institutional | 2024–Present | Core research infrastructure and personnel |
| Harvard Medical School | Institutional | 2024–Present | Clinical validation and annotation support |
| NIH/NIDDK | Federal Grant | 2024–2026 | Benchmark development and data curation |
ARISE maintains strict conflict of interest policies to protect the integrity of MAST evaluations. The following rules apply to all team members, advisors, and collaborators involved in the benchmark process:
During evaluation, model providers submit their systems through our controlled API pipeline. MAST does not share benchmark cases with model providers before or after evaluation. All evaluation data is processed in a secure environment, and model outputs are stored only for the duration needed to complete scoring.
De-identified clinical cases used in the benchmark are sourced from existing institutional research datasets with appropriate IRB approvals. No patient-identifiable information is included in any benchmark case. Model providers' API keys and system configurations are handled under standard data protection protocols and are not retained after evaluation completion.