General-Purpose AI Models Outshine Specialized Clinical Tools in Medical Knowledge Study
June 14, 2026
Across benchmarks MedQA, HealthBench, and real clinical queries, general-purpose LLMs outperformed domain-specific clinical tools, with Gemini, GPT-5.2, and Claude Opus 4.6 leading the way.
The study has NYU Langone IRB approval, and its data and code are publicly accessible, with datasets on HuggingFace and code on GitHub.
A Nature Portfolio study finds frontier general-purpose large language models surpass specialized clinical AI tools in medical knowledge, clinician alignment, and on 1,800 blinded physician annotations of real clinical queries, signaling a shift in how medical AI is evaluated.
Readers are encouraged to consult the full Nature paper for detailed methodology, metrics, and context behind the findings.
Noted limitations include potential benchmark overlap, lack of API access for some clinical tools, no latency or citation-quality assessment, and RCQ serving as primary evidence with HealthBench supplementary.
In MedQA, Gemini achieved 97.4% accuracy, followed by GPT-5.2 at 94.2% and Claude at 90.2%; clinical tools OpenEvidence and UpToDate scored 89.6% and 88.4%, respectively.
HealthBench results show GPT at 88.0%, Gemini at 79.3%, Claude at 77.0%; clinical tools trailed at 62.6% (OpenEvidence) and 61.3% (UpToDate).
The report was highlighted publicly by Vivek Subbiah, noting the provocative result that general-purpose LLMs may outperform specialized clinical AI in medical tasks.
The comparative setup pits frontier LLMs (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) against two clinical tools (OpenEvidence and UpToDate Expert AI) and a Google Search AI Overview control.
Implications suggest that scaling, alignment, and cross-domain reasoning could outweigh domain-specific tuning, pointing toward hospital-specific LLMs and careful use of frontier models for less-sensitive tasks.
The research team includes Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Ali Hage, and colleagues, with contributions from Eric Oermann and others.
A tiered performance pattern emerges: frontier LLMs (Gemini, GPT, Claude) rank in the first tier, while clinical tools and Google AI Overview sit in a secondary tier, with all nine significant cross-tier differences favoring frontier models.
Summary based on 2 sources
Get a daily email with more AI stories
Sources

Oncodaily - Oncology News, Insights, Stories • Jun 14, 2026
Vivek Subbiah: General-Purpose Frontier LLMs Outperform Specialized Clinical AI Tools - OncoDaily