AI Models Struggle with Scientific Reasoning: New Benchmark Highlights Key Shortcomings
August 25, 2025
AI models excel at basic perception tasks like identifying lab equipment and matching molecules to descriptions, but they struggle significantly with complex scientific reasoning, spatial understanding, and integrating multimodal information.
Current vision-language large models (VLLMs) are sensitive to question phrasing and tend to rely on pattern recognition rather than genuine understanding, especially in tasks involving cross-modal data synthesis.
The MaCBench benchmark evaluates AI performance across core scientific workflows—information extraction, experimental execution, and data interpretation—using diverse data types such as images, tables, and spectra in realistic scenarios.
Overcoming these limitations is essential for transforming AI from simple pattern matchers into reliable scientific partners capable of accelerating discovery and innovation.
Specific challenges include interpreting complex experimental data like AFM images and spectra, understanding spatial relationships in molecules, and performing sequential logical tasks, often with performance no better than random guessing.
Limitations include poor spatial reasoning, difficulties in multimodal data integration, and challenges with sequential reasoning, with models often depending on training data patterns rather than true scientific understanding.
Findings show that current models' performance degrades when tasks demand flexible information integration or sequential logic, highlighting reliance on pattern matching over genuine reasoning.
MaCBench is a comprehensive benchmark designed to evaluate the capabilities and limitations of multimodal language models (VLLMs) in scientific fields like chemistry and materials science.
Understanding these limitations is vital for guiding future AI development, emphasizing the need for better training data, methods that foster genuine reasoning, and improved cross-modal integration.
Advancing AI tools to support scientific discovery and autonomous laboratory operations requires addressing these challenges through improved training and reasoning approaches.
While VLLMs perform well in simple data extraction, they fail at tasks requiring deeper understanding, such as describing molecular relationships, interpreting complex experimental images, or performing multi-step logical inferences.
Model accuracy correlates with the internet prominence of crystal structures, suggesting models recall patterns rather than understanding, and they are sensitive to prompt wording and inference choices.
Summary based on 2 sources
Get a daily email with more AI stories
Sources

Research Matters • Aug 25, 2025
MaCBench - A new benchmark to assess AI-powered scientific assistants | Research Matters
Research Matters • Aug 25, 2025
MaCBench - A new benchmark to assess AI-powered scientific assistants | Research Matters