General-Purpose AI Models Outshine Specialized Clinical Tools in Medical Knowledge Study

June 14, 2026
General-Purpose AI Models Outshine Specialized Clinical Tools in Medical Knowledge Study
  • Across benchmarks MedQA, HealthBench, and real clinical queries, general-purpose LLMs outperformed domain-specific clinical tools, with Gemini, GPT-5.2, and Claude Opus 4.6 leading the way.

  • The study has NYU Langone IRB approval, and its data and code are publicly accessible, with datasets on HuggingFace and code on GitHub.

  • A Nature Portfolio study finds frontier general-purpose large language models surpass specialized clinical AI tools in medical knowledge, clinician alignment, and on 1,800 blinded physician annotations of real clinical queries, signaling a shift in how medical AI is evaluated.

  • Readers are encouraged to consult the full Nature paper for detailed methodology, metrics, and context behind the findings.

  • Noted limitations include potential benchmark overlap, lack of API access for some clinical tools, no latency or citation-quality assessment, and RCQ serving as primary evidence with HealthBench supplementary.

  • In MedQA, Gemini achieved 97.4% accuracy, followed by GPT-5.2 at 94.2% and Claude at 90.2%; clinical tools OpenEvidence and UpToDate scored 89.6% and 88.4%, respectively.

  • HealthBench results show GPT at 88.0%, Gemini at 79.3%, Claude at 77.0%; clinical tools trailed at 62.6% (OpenEvidence) and 61.3% (UpToDate).

  • The report was highlighted publicly by Vivek Subbiah, noting the provocative result that general-purpose LLMs may outperform specialized clinical AI in medical tasks.

  • The comparative setup pits frontier LLMs (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) against two clinical tools (OpenEvidence and UpToDate Expert AI) and a Google Search AI Overview control.

  • Implications suggest that scaling, alignment, and cross-domain reasoning could outweigh domain-specific tuning, pointing toward hospital-specific LLMs and careful use of frontier models for less-sensitive tasks.

  • The research team includes Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Ali Hage, and colleagues, with contributions from Eric Oermann and others.

  • A tiered performance pattern emerges: frontier LLMs (Gemini, GPT, Claude) rank in the first tier, while clinical tools and Google AI Overview sit in a secondary tier, with all nine significant cross-tier differences favoring frontier models.

Summary based on 2 sources


Get a daily email with more AI stories

Sources

Vivek Subbiah: General-Purpose Frontier LLMs Outperform Specialized Clinical AI Tools - OncoDaily

More Stories