|
2025 · OpenAI
COI: Developer-published
|
o3, GPT-4.1, Claude 3.7 |
B |
C |
A |
B |
B |
B |
DDx, triage, treatment advice, uncertainty communication |
vs. physician-assisted responses; absolute score |
B |
⚑
Conflict of Interest
OpenAI published and self-administered this benchmark. The grader is GPT-4.1, also an OpenAI product. While the physician involvement (262 MDs across 26 specialties) provides genuine external calibration, the scoring judge is inherently biased toward recognizing OpenAI model output patterns. Results should be interpreted with this structural conflict in mind.
Multi-turn conversational format is a genuine strength — it mirrors how patients actually interact with AI health tools. The 5,000 synthetic conversations span 7 clinical themes and reflect real complaint types. However, the fully synthetic generation pipeline means no actual patient encounters are represented. The adversarial construction adds valuable stress-testing but doesn't substitute for real-world messiness.
Fully synthetic. No real patient data. The physician refinement process (11 months, 262 MDs) gives the items meaningful clinical grounding, but synthetic data generation pipelines are known to produce distributional artifacts. No dataset contamination analysis is reported — a notable omission given that GPT-4.1 was used both to generate training data and to evaluate model outputs.
5,000 conversations, 48,562 binary evaluation criteria across 7 thematic areas and 49 languages. This is among the largest and most structurally rich medical eval sets published. HealthBench Hard (1,000 hardest cases) provides a meaningful ceiling test. The multilingual coverage is genuinely distinctive.
LLM-as-judge (GPT-4.1) with binary criteria per rubric item. Grader–physician agreement at macro F1 = 0.71, comparable to inter-physician agreement — this is a meaningful calibration signal. The binary criterion format reduces ambiguity vs. continuous scales. Main concern is the self-referential structure: GPT-4.1 grades outputs from models trained by OpenAI's competitors against criteria written by the same company.
HealthBench Hard provides an adversarial ceiling. The multilingual component tests non-English robustness. The benchmark covers a broad clinical topic range. However, as a single-release synthetic benchmark, it has no mechanism to detect test set leakage over time, and the demographic distribution of synthetic patient personas is not independently audited.
The physician comparison is one of HealthBench's genuine differentiators: the Sep 2024 model generation could be improved upon by physicians; the Apr 2025 generation (o3, GPT-4.1) could not. This is a meaningful clinical signal — though it tests physician agreement on synthetic cases, not real clinical outcomes. The lack of downstream outcome correlation remains an open limitation for all medical benchmarks.
Key Finding
o3: 60%; GPT-4o: 32%28% frontier improvement in months. Apr 2025 models match or exceed physician-assisted performance on synthetic health conversations.
Strengths
Massive scale, multilingual, multi-turn format, direct physician comparison, adversarial ceiling set (HB Hard). Best publicly available benchmark for conversational health AI.
Limitations
Fully synthetic data; self-referential judge (GPT-4.1 grading OpenAI competitors); no contamination analysis; no real outcome correlation; COI from developer-publisher structure.
|
|
Dec 2025 · Lumos AI
COI: Commercial developer
|
Claude Opus 4.1, GPT-5, o3 |
A |
C |
B |
B |
B |
C |
DDx reasoning, treatment planning, communication quality, 105 clinical dimensions |
Dimension-level 1–4 score distribution; no single leaderboard rank |
B |
⚑
Conflict of Interest
MedPI is published by Lumos AI, a commercial healthcare AI company. The benchmark architecture (multi-turn AI patient, ACGME-aligned dimensions) closely mirrors a product evaluation framework. The judge is a committee of Gemini 2.5 models, which introduces potential systematic bias against non-Google models. Independent replication has not been published.
MedPI's full multi-turn consultation format — where an LLM doctor interacts with an AI patient possessing memory, affect, and a synthetic medical record — is the most clinically realistic task structure of any benchmark reviewed here. The 34 clinical contexts span diagnosis and treatment objectives across major complaint categories. 105 evaluation dimensions mapped to ACGME competency framework is methodologically sophisticated and clinically grounded.
Fully synthetic patient packets. The authors explicitly note that patient demographics are non-uniform and unrealistic — a notable methodological transparency point, but also a real limitation. No real EHR data. The 34 clinical scenarios are expert-designed but not derived from actual encounters. Scenario diversity appears adequate but the synthetic generation pipeline details are underspecified.
7,097 total patient-doctor conversations across 9 models evaluated. 105 dimensions is exceptional granularity. The 34 clinical contexts represent a broad but not exhaustive specialty coverage. Most current model set of any benchmark reviewed — Claude Opus 4.1, GPT-5, o3, Grok-4, and others.
Committee of Gemini 2.5 models scoring on 1–4 scale with evidence-linked rationales per dimension. Committee structure reduces single-model judge variance. Providing dimension-level scores with rationales is significantly more interpretable than aggregate scores. The main concern is Gemini-judge bias — not quantified in the paper — and the absence of published inter-rater agreement statistics at the dimension level.
The 105-dimension structure enables fine-grained gap identification that aggregate benchmarks cannot provide. The AI patient with memory and affect models real consultation dynamics better than static prompts. Limitations include non-representative demographics and reliance on a single judge committee from one model family. No adversarial test set or test for prompt sensitivity is reported.
No direct physician performance comparator — the benchmark uses ACGME standards as a proxy for clinical competency expectations. This is a reasonable methodological choice but means results cannot be directly framed as "model vs. doctor." The ACGME alignment is strong for US training context but limits global clinical generalizability. Key finding that even top models score ≤15–20% on several critical dimensions (bias awareness, limitation disclosure) is a clinically meaningful signal.
Key Finding
≤15–20% on critical dimsEven frontier models (GPT-5, o3, Claude Opus 4.1) score poorly on bias awareness, limitation disclosure, and rare disease inclusion. No model achieves broad competence across all 105 dimensions.
Strengths
Most current model set; most clinically realistic task format; 105 ACGME-aligned dimensions enabling granular gap identification; committee judge structure; dimension-level rationales provided.
Limitations
Synthetic data with explicitly non-representative demographics; Gemini-committee judge bias unquantified; no direct physician comparator; commercial developer COI; independent replication absent.
|
|
2025 · Stanford CRFM
|
DeepSeek R1, o3-mini, Claude (living) |
B |
B |
A |
B |
A |
B |
Note generation, patient communication, decision support, admin, research summary |
Win-rate per task; category-level breakdown; open leaderboard |
B+ |
121 tasks across 5 clinical categories and 22 subcategories is excellent structural breadth. Tasks span the realistic range of clinician-facing NLP work: note generation, patient communication, decision support, administrative documentation, research summaries. The primarily single-turn format is a limitation — real clinical use is more dynamic — but the task diversity compensates meaningfully.
Mix of 35 established real-world and curated medical datasets — not purely synthetic, not de novo real patient data. Relies on publicly available or licensed medical NLP datasets, some of which are well-validated in the literature. Dataset provenance is transparent. The heterogeneity is a genuine strength but creates uneven quality across tasks; some constituent datasets are older and may not reflect current clinical language.
121 tasks is the largest task count reviewed. Living benchmark with continuous model additions — current leaderboard covers mid-2025 through present. Open leaderboard enables broad community participation. 5 categories and 22 subcategories provide structured coverage. The combination of breadth and continuous updating makes this the most scalable benchmark architecture reviewed.
LLM-jury scoring supplemented by automated metrics (ROUGE, BERTScore). Win-rate as primary metric is interpretable and avoids absolute score inflation. ICC agreement between jury and clinicians is measured — this calibration step is important and more rigorous than most. The multi-metric approach guards against single-metric gaming. Main limitation: win-rate is relative, not absolute — doesn't tell you whether any model is actually good.
As a living benchmark with open-source code and reproducible methodology (HELM framework), MedHELM is the most robustness-tested and replication-friendly benchmark reviewed. The 35-dataset heterogeneity stress-tests models across varied clinical contexts. Administration & Workflow category consistently producing low win-rates (0.53–0.63) suggests genuine task difficulty discrimination rather than ceiling effects. Stanford CRFM's academic independence reduces COI risk.
LLM-jury calibrated against actual clinician ratings with ICC measurement is a meaningful validity signal. No head-to-head model vs. physician performance framing — results are relative. The task diversity means some categories (clinical note generation, patient communication) have strong clinical face validity while others (research summaries) are more academic than clinical. The win-rate framing makes it harder to determine whether top performers are clinically adequate vs. merely relatively better.
Key Finding
DeepSeek R1: 0.66 win-rateReasoning models dominate. Administration & Workflow remains consistently hardest category (0.53–0.63). Non-reasoning models fall significantly behind at all cost tiers.
Strengths
Living benchmark with continuous updates; 121 tasks; open-source reproducible methodology; 35 real datasets; clinician-calibrated jury; academic independence from commercial AI labs.
Limitations
Mostly single-turn; win-rate is relative not absolute; heterogeneous dataset quality; no direct physician comparator; some datasets may be in model training corpora.
|
|
PSB 2026 · Stanford
|
Frontier 2025 models |
A |
C |
C |
B |
B |
D |
EHR navigation, order entry, data retrieval, clinical task completion |
Binary task success rate; no physician comparator |
B− |
The agentic task completion format — where a model must navigate a live EHR environment via FHIR API, retrieve data, and execute clinical actions — is the most operationally realistic benchmark design reviewed. This tests what AI agents will actually need to do in clinical deployment, not just what they can say. 300 physician-designed tasks across the full range of routine clinical workflows reflects genuine clinical input. Binary success scoring is appropriately demanding.
Synthetic EHR only — 100 realistic patient profiles containing 785,000 records (labs, vitals, meds, diagnoses, procedures). Clinically plausible but not real patient data. The synthetic EHR is much richer than most benchmarks' data environments, but agentic tasks may behave differently in real EHR systems with messy data, duplicate entries, and system integration constraints that synthetic environments cannot replicate.
300 tasks is modest for the scope of clinical workflows being evaluated. 100 synthetic patients limits demographic diversity. The model set is frontier-current (PSB 2026 submission) but specific model names are not foregrounded in available materials. The task set is the weakest scaling point — 300 physician-designed tasks is rigorous but insufficient to cover rare presentations and edge cases across clinical medicine.
Binary task success is demanding and unambiguous — either the agent correctly ordered the medication or it didn't. This is more rigorous than rubric-based scoring for agentic tasks. The physician-designed task criteria provide clinical grounding. Main limitation is that binary success doesn't capture partial completion quality — a model that almost correctly executes a complex order gets the same score as one that takes a completely wrong approach.
The FHIR API interaction layer grounds results in a realistic integration architecture used in actual health systems. Follow-up work showing "surprising improvement" with newer models validates the benchmark's sensitivity to model capability changes — a good robustness signal. Synthetic EHR limits real-world transfer. Results may not generalize across EHR vendors (Epic vs. Cerner vs. Meditech) which have meaningfully different data models.
No physician performance comparator is the central clinical validity gap. Tasks were designed by physicians, which provides face validity, but without a clinician baseline we cannot determine whether model success rates are clinically adequate or merely higher than chance. "Basic clinical housekeeping tasks as near-term feasible" is a qualitative claim without an empirical anchor. This is MedAgentBench's most significant methodological limitation.
Key Finding
Most frontier models struggleGPT-4 class models showed significant error rates on complex EHR navigation tasks. Newer models show substantial improvement. Basic housekeeping tasks assessed as near-term feasible.
Strengths
Only agentic EHR benchmark with FHIR API interaction; binary task success is demanding; physician-designed task set; tests real deployment architecture; most operationally relevant format for clinical AI.
Limitations
No physician comparator; synthetic EHR only; 300 tasks is modest; binary scoring misses partial completion quality; unclear cross-EHR generalizability.
|
|
Nov 2025 · arXiv 2511.14439
COI: Chinese institutional scope
|
GPT-5, Claude Sonnet 4.5 (15 models) |
A |
B |
A |
B |
B |
C |
Clinical knowledge, safety, multimodal reasoning, agentic task execution |
Three-track scoring; 0–100 scale; multi-specialty breakdown |
B |
⚑
Geographic/Institutional Scope
MedBench v4 is calibrated against Chinese clinical guidelines and expert consensus from Chinese medical institutions. Results reflect performance on Chinese clinical practice standards, not US or European guidelines. Model rankings may differ substantially when evaluated against Western clinical standards. This is particularly relevant for treatment recommendations and drug formulary decisions.
Three parallel evaluation tracks — base LLM (MCQ + open QA), multimodal (image + text), and agentic (multi-step workflow) — is the most comprehensive task architecture of any benchmark reviewed. Safety and ethics is an explicit scored dimension rather than an afterthought. The judge-calibration against 500+ institution consensus is methodologically rigorous. This structure enables direct comparison of model capabilities across different deployment contexts.
Expert-curated by clinicians from 500+ Chinese institutions across 24 primary and 91 secondary specialties. Over 700,000 tasks with multi-stage item refinement and multi-round clinical review. Not real patient EHR data, but the expert construction process is unusually rigorous. The 700K+ scale dramatically reduces item contamination risk. Items aligned to Chinese regulatory standards limits international generalizability.
700,000+ tasks, 91 sub-specialties, 15 frontier models including GPT-5 and Claude Sonnet 4.5. This is by far the largest task set of any benchmark reviewed. The 15-model evaluation with the most current frontier set provides the broadest comparative signal. Three parallel tracks enable specialty-specific model comparisons that aggregate benchmarks cannot provide.
Judge calibrated to human clinical ratings from 500+ institutions — this is strong calibration breadth. Three-track scoring with distinct mechanisms for each model type (LLM-judge for base, specialized metrics for multimodal, binary completion for agents) is methodologically appropriate. Safety scores across all models averaging 18.4/100 is a striking finding that merits independent replication — this could reflect genuine safety gaps or scoring methodology issues.
The 700K+ item scale and multi-stage refinement process provide strong robustness within the Chinese clinical context. The three-track architecture enables task-specific robustness testing. The 91-specialty coverage is unmatched. Primary generalizability concern is geographic: Chinese clinical guidelines differ meaningfully from US/EU standards, particularly in treatment protocols and drug availability. Results should not be directly extrapolated to Western deployment contexts.
Scoring calibrated against clinical expert consensus but no direct model vs. clinician performance comparison is reported. This is a meaningful gap for a benchmark at this scale — with 700K+ tasks and 500+ institution involvement, a clinician baseline study would be feasible. The Chinese guideline calibration limits direct applicability to US clinical contexts. The safety score of 18.4/100 mean across all models is clinically alarming if real, but requires independent validation.
Key Finding
Safety mean: 18.4/100Claude Sonnet 4.5 leads base LLMs (62.5/100); agent track averages 79.8/100. Safety and ethics dimension reveals alarming gaps across all model types — mean 18.4/100.
Strengths
700K+ tasks; 91 sub-specialties; three parallel tracks; most current model set; explicit safety scoring; 500+ institution calibration. Unmatched scale and structural sophistication.
Limitations
Chinese clinical guidelines only; no direct physician comparator; alarming safety scores unvalidated independently; cloud-based platform limits accessibility; geographic scope limits direct US/EU applicability.
|