Deep-Thinking Tokens Outperform Length-Based Predictors, Boost AI Efficiency and Accuracy
February 22, 2026
The main takeaway is that token length is a poor predictor of performance; deep-thinking tokens better indicate true effort, and a measure called Deep-Thinking Ratio (DTR) outperforms length-based and confidence-based baselines, enabling Think@n to scale efficiently and allow early halting to cut costs significantly.
Depth is measured by projecting intermediate hidden states into vocabulary space and using the Jensen-Shannon Divergence between layer distributions, with a depth threshold around 0.85 to classify tokens as deep-thinking.
DTR, defined as the percentage of deep-thinking tokens in a sequence, shows a strong positive correlation with accuracy across models, indicating more deep-thinking tokens tend to yield better results.
On the AIME 2025 benchmark, Think@n achieved higher accuracy than Cons@n while reducing inference cost by roughly half, illustrating substantial efficiency gains.
A new study from Google and University of Virginia challenges the notion that longer chain-of-thought improves accuracy, proposing DTR as a better measure of meaningful computation.
Think@n is an inference-time technique that halts unpromising candidates after about 50 tokens by evaluating DTR, reducing computation without sacrificing, and often improving, accuracy compared to Self-Consistency.
Token count correlates negatively with accuracy, meaning producing more text can actually decrease performance due to overthinking and wasted compute.
Deep-thinking tokens are those whose predictions shift significantly in deeper model layers before stabilizing, contrasting with shallow tokens that stabilize early.
Summary based on 1 source
