Apple AI Study Reveals Flaws in 'Thinking' Models on Complex Tasks: Are They Truly Reasoning?

Researchers tested these models on classic puzzles such as Tower of Hanoi, Conway’s Soldiers, river crossing, and block-stacking to assess their reasoning processes.
On simpler tasks, thinking models often waste excessive computational resources by exploring multiple options before arriving at the correct answer.
While thinking models outperform non-thinking models on moderate difficulty tasks, their accuracy drops significantly with more complex problems, such as those involving eight or more disks in Tower of Hanoi.
The study compares these advanced 'thinking' models, known as large reasoning models (LRMs), with earlier non-thinking models (LLMs) that generate responses based on probability without genuine reasoning capabilities.
Apple AI researchers have published a paper critically examining the effectiveness of recent 'thinking' AI models like ChatGPT 5.0, Claude 4 Sonnet, DeepSeek, and Grok 3, which aim to simulate reasoning and problem-solving.
Overall, the paper concludes that current reasoning AI models do not truly 'know' what they are doing; they can produce probable responses but struggle to maintain or verify a consistent line of reasoning, especially on complex problems.
A surprising finding was that even when given the known algorithmic solution for Tower of Hanoi, these models still explored incorrect options and failed to follow the correct reasoning path.

Summary based on 1 source

Get a daily email with more AI stories

A Rich Life • Aug 23, 2025