Deep-Thinking Tokens Outperform Length-Based Predictors, Boost AI Efficiency and Accuracy

February 22, 2026
Deep-Thinking Tokens Outperform Length-Based Predictors, Boost AI Efficiency and Accuracy
  • The main takeaway is that token length is a poor predictor of performance; deep-thinking tokens better indicate true effort, and a measure called Deep-Thinking Ratio (DTR) outperforms length-based and confidence-based baselines, enabling Think@n to scale efficiently and allow early halting to cut costs significantly.

  • Depth is measured by projecting intermediate hidden states into vocabulary space and using the Jensen-Shannon Divergence between layer distributions, with a depth threshold around 0.85 to classify tokens as deep-thinking.

  • DTR, defined as the percentage of deep-thinking tokens in a sequence, shows a strong positive correlation with accuracy across models, indicating more deep-thinking tokens tend to yield better results.

  • On the AIME 2025 benchmark, Think@n achieved higher accuracy than Cons@n while reducing inference cost by roughly half, illustrating substantial efficiency gains.

  • A new study from Google and University of Virginia challenges the notion that longer chain-of-thought improves accuracy, proposing DTR as a better measure of meaningful computation.

  • Think@n is an inference-time technique that halts unpromising candidates after about 50 tokens by evaluating DTR, reducing computation without sacrificing, and often improving, accuracy compared to Self-Consistency.

  • Token count correlates negatively with accuracy, meaning producing more text can actually decrease performance due to overthinking and wasted compute.

  • Deep-thinking tokens are those whose predictions shift significantly in deeper model layers before stabilizing, contrasting with shallow tokens that stabilize early.

Summary based on 1 source


Get a daily email with more AI stories

More Stories