Medical LLM Benchmark Review

Grading Dimensions

Each benchmark is graded across seven dimensions. Click any row in the tables below to expand the full rationale behind each grade. Dimensions marked ★ receive A–F grades; others are descriptive.

★ Task Design

Does the eval task genuinely reflect clinical work? Are prompts ecologically valid?

★ Data Quality

Real EHR vs. synthetic? Expert-curated? Dataset contamination risk?

★ Scale

Sample size, breadth of specialties, number of models evaluated.

★ Scoring Rigor

Reliability of the scoring mechanism. LLM-judge calibration, inter-rater agreement.

★ Robustness & Generalizability

Does the benchmark hold up under scrutiny? Diverse populations, edge cases, replication?

★ Clinical Validity

Does the benchmark correlate with real-world clinical performance? Physician comparison?

Output Measured

What the model must produce. Descriptive, not graded.

No grade assigned

Results Framing

How performance is reported and contextualised. Descriptive, not graded.

No grade assigned

Grades A — Exemplary B — Strong C — Adequate D — Weak F — Insufficient

Current Evals

5 benchmarks

Benchmarks that evaluate the latest 1–2 generations of frontier models (roughly GPT-4.1 / o3 / Claude 3.7+ era and beyond). Given how rapidly model capabilities are advancing, only these evaluations carry meaningful signal for understanding where frontier AI stands today.

✓ Frontier model coverage

Benchmark	Models Tested	Task Design	Data Quality	Scale	Scoring Rigor	Robustness	Clinical Validity	Output Measured	Results Framing	Overall
HealthBench ▶ 2025 · OpenAI openai.com/healthbench COI: Developer-published	o3, GPT-4.1, Claude 3.7	B	C	A	B	B	B	DDx, triage, treatment advice, uncertainty communication	vs. physician-assisted responses; absolute score	B
⚑ Conflict of Interest OpenAI published and self-administered this benchmark. The grader is GPT-4.1, also an OpenAI product. While the physician involvement (262 MDs across 26 specialties) provides genuine external calibration, the scoring judge is inherently biased toward recognizing OpenAI model output patterns. Results should be interpreted with this structural conflict in mind. Task Design B Multi-turn conversational format is a genuine strength — it mirrors how patients actually interact with AI health tools. The 5,000 synthetic conversations span 7 clinical themes and reflect real complaint types. However, the fully synthetic generation pipeline means no actual patient encounters are represented. The adversarial construction adds valuable stress-testing but doesn't substitute for real-world messiness. Data Quality C Fully synthetic. No real patient data. The physician refinement process (11 months, 262 MDs) gives the items meaningful clinical grounding, but synthetic data generation pipelines are known to produce distributional artifacts. No dataset contamination analysis is reported — a notable omission given that GPT-4.1 was used both to generate training data and to evaluate model outputs. Scale A 5,000 conversations, 48,562 binary evaluation criteria across 7 thematic areas and 49 languages. This is among the largest and most structurally rich medical eval sets published. HealthBench Hard (1,000 hardest cases) provides a meaningful ceiling test. The multilingual coverage is genuinely distinctive. Scoring Rigor B LLM-as-judge (GPT-4.1) with binary criteria per rubric item. Grader–physician agreement at macro F1 = 0.71, comparable to inter-physician agreement — this is a meaningful calibration signal. The binary criterion format reduces ambiguity vs. continuous scales. Main concern is the self-referential structure: GPT-4.1 grades outputs from models trained by OpenAI's competitors against criteria written by the same company. Robustness & Generalizability B HealthBench Hard provides an adversarial ceiling. The multilingual component tests non-English robustness. The benchmark covers a broad clinical topic range. However, as a single-release synthetic benchmark, it has no mechanism to detect test set leakage over time, and the demographic distribution of synthetic patient personas is not independently audited. Clinical Validity B The physician comparison is one of HealthBench's genuine differentiators: the Sep 2024 model generation could be improved upon by physicians; the Apr 2025 generation (o3, GPT-4.1) could not. This is a meaningful clinical signal — though it tests physician agreement on synthetic cases, not real clinical outcomes. The lack of downstream outcome correlation remains an open limitation for all medical benchmarks. Bottom Line Key Finding o3: 60%; GPT-4o: 32%28% frontier improvement in months. Apr 2025 models match or exceed physician-assisted performance on synthetic health conversations. Strengths Massive scale, multilingual, multi-turn format, direct physician comparison, adversarial ceiling set (HB Hard). Best publicly available benchmark for conversational health AI. Limitations Fully synthetic data; self-referential judge (GPT-4.1 grading OpenAI competitors); no contamination analysis; no real outcome correlation; COI from developer-publisher structure.
MedPI ▶ Dec 2025 · Lumos AI medpi.thelumos.ai COI: Commercial developer	Claude Opus 4.1, GPT-5, o3	A	C	B	B	B	C	DDx reasoning, treatment planning, communication quality, 105 clinical dimensions	Dimension-level 1–4 score distribution; no single leaderboard rank	B
⚑ Conflict of Interest MedPI is published by Lumos AI, a commercial healthcare AI company. The benchmark architecture (multi-turn AI patient, ACGME-aligned dimensions) closely mirrors a product evaluation framework. The judge is a committee of Gemini 2.5 models, which introduces potential systematic bias against non-Google models. Independent replication has not been published. Task Design A MedPI's full multi-turn consultation format — where an LLM doctor interacts with an AI patient possessing memory, affect, and a synthetic medical record — is the most clinically realistic task structure of any benchmark reviewed here. The 34 clinical contexts span diagnosis and treatment objectives across major complaint categories. 105 evaluation dimensions mapped to ACGME competency framework is methodologically sophisticated and clinically grounded. Data Quality C Fully synthetic patient packets. The authors explicitly note that patient demographics are non-uniform and unrealistic — a notable methodological transparency point, but also a real limitation. No real EHR data. The 34 clinical scenarios are expert-designed but not derived from actual encounters. Scenario diversity appears adequate but the synthetic generation pipeline details are underspecified. Scale B 7,097 total patient-doctor conversations across 9 models evaluated. 105 dimensions is exceptional granularity. The 34 clinical contexts represent a broad but not exhaustive specialty coverage. Most current model set of any benchmark reviewed — Claude Opus 4.1, GPT-5, o3, Grok-4, and others. Scoring Rigor B Committee of Gemini 2.5 models scoring on 1–4 scale with evidence-linked rationales per dimension. Committee structure reduces single-model judge variance. Providing dimension-level scores with rationales is significantly more interpretable than aggregate scores. The main concern is Gemini-judge bias — not quantified in the paper — and the absence of published inter-rater agreement statistics at the dimension level. Robustness & Generalizability B The 105-dimension structure enables fine-grained gap identification that aggregate benchmarks cannot provide. The AI patient with memory and affect models real consultation dynamics better than static prompts. Limitations include non-representative demographics and reliance on a single judge committee from one model family. No adversarial test set or test for prompt sensitivity is reported. Clinical Validity C No direct physician performance comparator — the benchmark uses ACGME standards as a proxy for clinical competency expectations. This is a reasonable methodological choice but means results cannot be directly framed as "model vs. doctor." The ACGME alignment is strong for US training context but limits global clinical generalizability. Key finding that even top models score ≤15–20% on several critical dimensions (bias awareness, limitation disclosure) is a clinically meaningful signal. Bottom Line Key Finding ≤15–20% on critical dimsEven frontier models (GPT-5, o3, Claude Opus 4.1) score poorly on bias awareness, limitation disclosure, and rare disease inclusion. No model achieves broad competence across all 105 dimensions. Strengths Most current model set; most clinically realistic task format; 105 ACGME-aligned dimensions enabling granular gap identification; committee judge structure; dimension-level rationales provided. Limitations Synthetic data with explicitly non-representative demographics; Gemini-committee judge bias unquantified; no direct physician comparator; commercial developer COI; independent replication absent.
MedHELM ▶ 2025 · Stanford CRFM crfm.stanford.edu/helm/medhelm	DeepSeek R1, o3-mini, Claude (living)	B	B	A	B	A	B	Note generation, patient communication, decision support, admin, research summary	Win-rate per task; category-level breakdown; open leaderboard	B+
Task Design B 121 tasks across 5 clinical categories and 22 subcategories is excellent structural breadth. Tasks span the realistic range of clinician-facing NLP work: note generation, patient communication, decision support, administrative documentation, research summaries. The primarily single-turn format is a limitation — real clinical use is more dynamic — but the task diversity compensates meaningfully. Data Quality B Mix of 35 established real-world and curated medical datasets — not purely synthetic, not de novo real patient data. Relies on publicly available or licensed medical NLP datasets, some of which are well-validated in the literature. Dataset provenance is transparent. The heterogeneity is a genuine strength but creates uneven quality across tasks; some constituent datasets are older and may not reflect current clinical language. Scale A 121 tasks is the largest task count reviewed. Living benchmark with continuous model additions — current leaderboard covers mid-2025 through present. Open leaderboard enables broad community participation. 5 categories and 22 subcategories provide structured coverage. The combination of breadth and continuous updating makes this the most scalable benchmark architecture reviewed. Scoring Rigor B LLM-jury scoring supplemented by automated metrics (ROUGE, BERTScore). Win-rate as primary metric is interpretable and avoids absolute score inflation. ICC agreement between jury and clinicians is measured — this calibration step is important and more rigorous than most. The multi-metric approach guards against single-metric gaming. Main limitation: win-rate is relative, not absolute — doesn't tell you whether any model is actually good. Robustness & Generalizability A As a living benchmark with open-source code and reproducible methodology (HELM framework), MedHELM is the most robustness-tested and replication-friendly benchmark reviewed. The 35-dataset heterogeneity stress-tests models across varied clinical contexts. Administration & Workflow category consistently producing low win-rates (0.53–0.63) suggests genuine task difficulty discrimination rather than ceiling effects. Stanford CRFM's academic independence reduces COI risk. Clinical Validity B LLM-jury calibrated against actual clinician ratings with ICC measurement is a meaningful validity signal. No head-to-head model vs. physician performance framing — results are relative. The task diversity means some categories (clinical note generation, patient communication) have strong clinical face validity while others (research summaries) are more academic than clinical. The win-rate framing makes it harder to determine whether top performers are clinically adequate vs. merely relatively better. Bottom Line Key Finding DeepSeek R1: 0.66 win-rateReasoning models dominate. Administration & Workflow remains consistently hardest category (0.53–0.63). Non-reasoning models fall significantly behind at all cost tiers. Strengths Living benchmark with continuous updates; 121 tasks; open-source reproducible methodology; 35 real datasets; clinician-calibrated jury; academic independence from commercial AI labs. Limitations Mostly single-turn; win-rate is relative not absolute; heterogeneous dataset quality; no direct physician comparator; some datasets may be in model training corpora.
MedAgentBench v2 ▶ PSB 2026 · Stanford psb.stanford.edu · PSB26	Frontier 2025 models	A	C	C	B	B	D	EHR navigation, order entry, data retrieval, clinical task completion	Binary task success rate; no physician comparator	B−
Task Design A The agentic task completion format — where a model must navigate a live EHR environment via FHIR API, retrieve data, and execute clinical actions — is the most operationally realistic benchmark design reviewed. This tests what AI agents will actually need to do in clinical deployment, not just what they can say. 300 physician-designed tasks across the full range of routine clinical workflows reflects genuine clinical input. Binary success scoring is appropriately demanding. Data Quality C Synthetic EHR only — 100 realistic patient profiles containing 785,000 records (labs, vitals, meds, diagnoses, procedures). Clinically plausible but not real patient data. The synthetic EHR is much richer than most benchmarks' data environments, but agentic tasks may behave differently in real EHR systems with messy data, duplicate entries, and system integration constraints that synthetic environments cannot replicate. Scale C 300 tasks is modest for the scope of clinical workflows being evaluated. 100 synthetic patients limits demographic diversity. The model set is frontier-current (PSB 2026 submission) but specific model names are not foregrounded in available materials. The task set is the weakest scaling point — 300 physician-designed tasks is rigorous but insufficient to cover rare presentations and edge cases across clinical medicine. Scoring Rigor B Binary task success is demanding and unambiguous — either the agent correctly ordered the medication or it didn't. This is more rigorous than rubric-based scoring for agentic tasks. The physician-designed task criteria provide clinical grounding. Main limitation is that binary success doesn't capture partial completion quality — a model that almost correctly executes a complex order gets the same score as one that takes a completely wrong approach. Robustness & Generalizability B The FHIR API interaction layer grounds results in a realistic integration architecture used in actual health systems. Follow-up work showing "surprising improvement" with newer models validates the benchmark's sensitivity to model capability changes — a good robustness signal. Synthetic EHR limits real-world transfer. Results may not generalize across EHR vendors (Epic vs. Cerner vs. Meditech) which have meaningfully different data models. Clinical Validity D No physician performance comparator is the central clinical validity gap. Tasks were designed by physicians, which provides face validity, but without a clinician baseline we cannot determine whether model success rates are clinically adequate or merely higher than chance. "Basic clinical housekeeping tasks as near-term feasible" is a qualitative claim without an empirical anchor. This is MedAgentBench's most significant methodological limitation. Bottom Line Key Finding Most frontier models struggleGPT-4 class models showed significant error rates on complex EHR navigation tasks. Newer models show substantial improvement. Basic housekeeping tasks assessed as near-term feasible. Strengths Only agentic EHR benchmark with FHIR API interaction; binary task success is demanding; physician-designed task set; tests real deployment architecture; most operationally relevant format for clinical AI. Limitations No physician comparator; synthetic EHR only; 300 tasks is modest; binary scoring misses partial completion quality; unclear cross-EHR generalizability.
MedBench v4 ▶ Nov 2025 · arXiv 2511.14439 arxiv.org/abs/2511.14439 COI: Chinese institutional scope	GPT-5, Claude Sonnet 4.5 (15 models)	A	B	A	B	B	C	Clinical knowledge, safety, multimodal reasoning, agentic task execution	Three-track scoring; 0–100 scale; multi-specialty breakdown	B
⚑ Geographic/Institutional Scope MedBench v4 is calibrated against Chinese clinical guidelines and expert consensus from Chinese medical institutions. Results reflect performance on Chinese clinical practice standards, not US or European guidelines. Model rankings may differ substantially when evaluated against Western clinical standards. This is particularly relevant for treatment recommendations and drug formulary decisions. Task Design A Three parallel evaluation tracks — base LLM (MCQ + open QA), multimodal (image + text), and agentic (multi-step workflow) — is the most comprehensive task architecture of any benchmark reviewed. Safety and ethics is an explicit scored dimension rather than an afterthought. The judge-calibration against 500+ institution consensus is methodologically rigorous. This structure enables direct comparison of model capabilities across different deployment contexts. Data Quality B Expert-curated by clinicians from 500+ Chinese institutions across 24 primary and 91 secondary specialties. Over 700,000 tasks with multi-stage item refinement and multi-round clinical review. Not real patient EHR data, but the expert construction process is unusually rigorous. The 700K+ scale dramatically reduces item contamination risk. Items aligned to Chinese regulatory standards limits international generalizability. Scale A 700,000+ tasks, 91 sub-specialties, 15 frontier models including GPT-5 and Claude Sonnet 4.5. This is by far the largest task set of any benchmark reviewed. The 15-model evaluation with the most current frontier set provides the broadest comparative signal. Three parallel tracks enable specialty-specific model comparisons that aggregate benchmarks cannot provide. Scoring Rigor B Judge calibrated to human clinical ratings from 500+ institutions — this is strong calibration breadth. Three-track scoring with distinct mechanisms for each model type (LLM-judge for base, specialized metrics for multimodal, binary completion for agents) is methodologically appropriate. Safety scores across all models averaging 18.4/100 is a striking finding that merits independent replication — this could reflect genuine safety gaps or scoring methodology issues. Robustness & Generalizability B The 700K+ item scale and multi-stage refinement process provide strong robustness within the Chinese clinical context. The three-track architecture enables task-specific robustness testing. The 91-specialty coverage is unmatched. Primary generalizability concern is geographic: Chinese clinical guidelines differ meaningfully from US/EU standards, particularly in treatment protocols and drug availability. Results should not be directly extrapolated to Western deployment contexts. Clinical Validity C Scoring calibrated against clinical expert consensus but no direct model vs. clinician performance comparison is reported. This is a meaningful gap for a benchmark at this scale — with 700K+ tasks and 500+ institution involvement, a clinician baseline study would be feasible. The Chinese guideline calibration limits direct applicability to US clinical contexts. The safety score of 18.4/100 mean across all models is clinically alarming if real, but requires independent validation. Bottom Line Key Finding Safety mean: 18.4/100Claude Sonnet 4.5 leads base LLMs (62.5/100); agent track averages 79.8/100. Safety and ethics dimension reveals alarming gaps across all model types — mean 18.4/100. Strengths 700K+ tasks; 91 sub-specialties; three parallel tracks; most current model set; explicit safety scoring; 500+ institution calibration. Unmatched scale and structural sophistication. Limitations Chinese clinical guidelines only; no direct physician comparator; alarming safety scores unvalidated independently; cloud-based platform limits accessibility; geographic scope limits direct US/EU applicability.

Historic Evals

3 benchmarks

Benchmarks that do not evaluate the latest 1–2 model generations. Given the pace of capability advancement, results from these benchmarks provide limited signal about current model performance. They retain methodological value — particularly NOHARM's real clinical data and harm framework — but should not drive current model selection decisions.

⚠ Pre-frontier model coverage

Benchmark	Models Tested	Task Design	Data Quality	Scale	Scoring Rigor	Robustness	Clinical Validity	Output Measured	Results Framing	Overall
NOHARM ▶ Dec 2024 · arXiv 2512.01241 arxiv.org/abs/2512.01241	31 LLMs (GPT-4 era, late 2024)	A	A	D	A	B	A	Clinical management recommendations; harm frequency and severity classification	vs. generalist physicians; harm rate and severity; benchmark correlation	A−
Task Design A NOHARM's task design is methodologically exemplary: real primary care referral cases, expert-annotated management options with harm severity ratings, and direct model recommendation scoring against clinician-grounded criteria. The harm-of-omission focus (models miss things more than recommend wrong things) is a clinically crucial finding that no other benchmark captures with this specificity. The management recommendation format mirrors actual clinical decision-making better than QA or note-generation tasks. Data Quality A 100 actual primary care-to-specialist referral cases — this is real clinical data, not synthetic or exam-derived. 12,747 expert annotations across 10 specialties. This is the gold standard data quality among all benchmarks reviewed; only MedAgentBench attempts a comparable real-data approach, and that uses synthetic EHR rather than actual cases. The small case count is a scale limitation but the data quality per case is unmatched. Scale D 100 cases is genuinely small. Even with 12,747 annotations, the underlying case pool is insufficient to cover the breadth of clinical medicine, rare presentations, or demographic diversity. 10 specialties is a reasonable selection but leaves most of medicine uncovered. The limited case count is the primary limitation of NOHARM and the main reason it cannot serve as a primary benchmark despite its exceptional methodological quality on other dimensions. Scoring Rigor A Expert annotation of management options with severity ratings, then mapping model outputs to the annotated option set, is the most rigorously clinical scoring methodology reviewed. No LLM-judge — actual clinical expert judgment is the scoring ground truth. The harm severity classification (distinguishing severe harm risk from moderate) adds clinical specificity. The finding that safety correlates only moderately with standard benchmarks (r = 0.61–0.64) is a methodologically important validation of NOHARM's discriminant validity. Robustness & Generalizability B The real-data foundation and expert annotation process are robustness strengths. The moderate correlation with standard benchmarks demonstrates genuine discriminant validity — NOHARM is measuring something different from what other benchmarks measure, which is strong evidence it captures a distinct clinical signal. Main limitations: small case pool limits confidence in rare presentation coverage; primary care-to-specialist referral context doesn't generalize across all clinical settings. Clinical Validity A Direct comparison against generalist physicians is NOHARM's defining strength: best models outperform generalist physicians on safety by 9.7 percentage points (95% CI 7.0–12.5%). This is one of the clearest, most rigorously derived model-vs-clinician comparisons in the medical AI literature. The multi-agent approach further improves safety vs. solo models by ~8% — a directly actionable clinical deployment finding. Real cases, real clinical standards, real physician baseline. Bottom Line Key Finding +9.7pp vs. generalistsBest models outperform generalist physicians on safety. 76.6% of all errors are harms of omission. Up to 22.2% severe harm risk across 31 LLMs. Safety ≠ benchmark score (r = 0.61). Strengths Only benchmark with real clinical cases AND direct physician comparison; expert annotation scoring; harm severity classification; harm-of-omission finding is a unique clinical signal not captured elsewhere. Limitations 100 cases is too small for broad generalization; GPT-4 era models only — urgently needs update to frontier models; single-turn; primary care referral context only.
LLMEval-Med ▶ Jun 2025 · Fudan University arxiv.org/abs/2506.04078 COI: Chinese clinical context	o1-preview, GPT-4o (mid-2025 vintage)	B	A	C	A	B	C	Medical knowledge, language understanding, clinical reasoning, text generation, safety	Per-dimension scores; human-machine agreement tracking	B−
⚑ Geographic Scope LLMEval-Med uses real EHR data from Chinese clinical contexts, calibrated against Chinese clinical standards. Results reflect performance in a Chinese healthcare system context. Does not include GPT-4.1, o3, or Claude 3.5+ vintage models — predates the most significant frontier capability advances of 2025. Task Design B Five-dimension evaluation framework (knowledge, language understanding, clinical reasoning, text generation, safety/ethics) with 27 sub-indicators is well-structured. Heavy emphasis on open-ended QA rather than MCQ better reflects real clinical discourse. Dynamic checklist refinement when human-machine agreement falls below threshold is a methodologically sophisticated adaptive protocol. Single-turn format is a limitation but somewhat offset by EHR-derived data quality. Data Quality A EHR-derived questions from real clinical records — one of only two benchmarks reviewed (alongside NOHARM) with genuine real-world data sourcing. Expert-designed checklists iteratively refined against physician agreement thresholds. The dynamic refinement protocol is methodologically important: it creates a quality feedback loop absent in static benchmarks. Medical professional construction of reference answers provides genuine clinical grounding. Scale C 2,996 questions across 27 sub-indicators and 13 LLMs is moderate — adequate for the methodology but insufficient to cover rare presentations or demographic diversity. 13 models is reasonable but outdated — does not include any frontier models from the GPT-4.1/o3/Claude 3.7+ generation. The per-model breakdown on GitHub is a transparency positive. Scoring Rigor A Triple-scoring approach (GPT-4o scores each response three times independently, then averaged) with scoring variability under 1% is the most rigorously validated judge reliability of any benchmark reviewed. Physician parallel rating for text generation tasks adds a direct human calibration layer. The dynamic checklist refinement protocol eliminates systematically disagreed items — an unusual and rigorous quality control mechanism. Robustness & Generalizability B Real EHR sourcing and dynamic protocol refinement provide strong internal robustness. The finding that domain-specific medical LLMs outperform general-purpose models on most dimensions is clinically meaningful and replicable. Main generalizability concern is the Chinese clinical context — EHR data and clinical standards differ from US/EU settings. Open-source GitHub dataset enables independent replication, which is a robustness strength. Clinical Validity C Reference answers and checklists from medical professionals, but no head-to-head model vs. clinician accuracy comparison. The physician validation of scoring criteria provides face validity but not performance validity. The safety gap finding (consistent across model types) is clinically relevant but the absolute scores are hard to interpret without a physician baseline. Chinese EHR context limits US clinical applicability. Bottom Line Key Finding Medical LLMs > general-purposeDomain-specific models outperform on most dimensions. Open-source (DeepSeek-R1, Qwen) competitive with closed-source on reasoning. Safety gaps consistent across all model types. Strengths Real EHR-derived data; triple-scoring with <1% variability; dynamic checklist refinement; 27 sub-indicators; open-source dataset; most rigorous scoring methodology reviewed. Limitations No frontier model coverage (predates GPT-4.1/o3/Claude 3.7+); Chinese clinical context only; no physician performance comparator; single-turn only; 2,996 items is limited for specialty diversity.
Sophont MedMarks ▶ Pending review · Sophont sophont.med/blog/medmarks	Multi-benchmark (varies)	N/A	N/A	N/A	N/A	N/A	N/A	Aggregation of results across existing benchmarks	Synthesized view across multiple benchmarks	Pending
Review Status This entry requires direct primary review of the Sophont MedMarks source. Based on available information, MedMarks appears to function as a benchmark aggregation and synthesis resource — compiling and contextualizing results across existing medical LLM evaluations rather than running an independent benchmark. This is a different category from the other entries here (which are all primary benchmarks) and may warrant a separate "Aggregator" section. Grades and full analysis pending direct review. Bottom Line Key Finding Pending reviewFull analysis to be completed after direct methodology review. Potential Strengths Aggregation approach may provide cross-benchmark synthesis not available from any single eval. Multi-benchmark view could reveal consistency or inconsistency across evaluations. Potential Limitations Aggregators are dependent on the quality of underlying benchmarks. Synthesis methodology and weighting approach critical to assess. Not an independent benchmark — inherits all limitations of component evals.

Note on Historic classification: Benchmarks in this section use model cohorts that predate the GPT-4.1 / o3 / Claude 3.7 Sonnet generation — roughly the frontier as of early 2025. Given the pace of capability change, results from these evaluations should be treated as providing methodological signal and historical context, not current model performance data. NOHARM is highlighted as a methodological exemplar despite its historic model set — its real-data, physician-comparison framework represents a gold standard that current-generation benchmarks should aspire to replicate.