About

The Medical AI Superintelligence Test is an independent effort run by the ARISE AI Research Network to curate robust and realistic clinical benchmarks to measure the performance of medical AI. MAST exists to ensure that AI entering healthcare is rigorously tested, independently validated, and held to the highest clinical standards before it reaches patients.

Our Mission

To establish an open, evidence-based evaluation framework that holds medical AI to the highest clinical standards, ensuring that deployed systems help rather than harm patients. We believe rigorous, independent benchmarking is the foundation of safe AI adoption in healthcare.

Team

MAST is developed by a multidisciplinary team of clinicians, AI researchers, biostatisticians, and medical educators from ARISE, an independent academic collaborative spanning Stanford Medicine, Harvard Medical School, and partner institutions.

The MAST Steering Committee provides strategic direction, approves new evaluation domains, and sets the weighting methodology for the composite score.

The annotation workforce consists of board-certified physicians across many medical specialties who undergo standardized training on scoring rubrics before participating in evaluations. Technical infrastructure is maintained by a dedicated engineering team responsible for the evaluation pipeline, data security, and leaderboard operations.

Meet the full team

Independence and Funding

MAST is developed by ARISE, an independent academic research network. MAST may accept external funding, but any external funds support the general development of benchmarks, human baselines, and model testing, and are never directed toward any particular model, evaluation, or outcome, to prevent conflicts of interest.

Our evaluation schedule, scoring rubrics, and publication timeline are determined by the MAST Steering Committee. Model providers are notified of results only after scoring is finalized, and they have no opportunity to influence or preview findings before publication.

For most benchmarks, MAST open-sources at least 20% of the evaluation set, and maintains a private held-out set where possible to prevent overfitting.

MAST team members disclose the following funding and conflicts:

Stanford Bio-X Interdisciplinary Initiatives Program Seed Grant (Round 12, 2024)
National Institutes of Health / National Institute of Allergy and Infectious Diseases (1R01AI17812101)
NIH National Center for Advancing Translational Sciences Clinical and Translational Science Award (UM1TR004921)
NIH Center for Undiagnosed Diseases at Stanford (U01 NS134358)
Stanford RAISE Health Seed Grant (2024)
Josiah Macy Jr. Foundation (AI in Medical Education)
Google DeepMind (consulting / advising)
Meta AI (consulting / advising)
Sermo (consulting / advising)

We have accepted token credits from the following companies to run benchmark inference. All judging costs are paid from MAST general funds, so no evaluated company pays for the evaluation that scores it:

AMBOSS
Anthropic
Doximity
Google DeepMind

Conflict of Interest Policy

ARISE maintains conflict of interest policies to protect the integrity of MAST evaluations. The following principles apply to team members and collaborators involved in the benchmark process:

External funders and AI providers cannot fund, sponsor, or direct specific evaluations, scoring, or scheduling. Funding supports general benchmark development only.
Team members with financial affiliations to an evaluated AI company recuse themselves from scoring that company's submissions.
Team members disclose financial relationships or affiliations with AI companies that could affect their objectivity.
Evaluation rubrics and scoring criteria are locked before any model is evaluated and are not modified retroactively.

Data Use Policy

During evaluation, model providers submit their systems through our controlled API pipeline. MAST does not share benchmark cases with model providers before or after evaluation. All evaluation data is processed in a secure environment, and model outputs are stored only for the duration needed to complete scoring.

De-identified clinical cases used in the benchmark contain no patient-identifiable information. Model providers' API keys and system configurations are handled under standard data protection protocols and are not retained after evaluation completion.

Contact

For questions about MAST, our methodology, or our transparency practices, email contact@arise-ai.org.