Microsoft's Phi-4 Models Outshine Larger AI in Complex Reasoning Tasks
July 9, 2025
Phi-4-reasoning models, including the 14-billion-parameter Phi-4-reasoning and its enhanced version Phi-4-reasoning-plus, are demonstrating remarkable performance on complex reasoning benchmarks like AIME 2025, often surpassing larger models.
Developed by Microsoft Research, these models are trained through supervised finetuning on diverse prompts and reasoning demonstrations, with the plus version incorporating reinforcement learning to generate longer, more accurate reasoning traces.
Their success hinges on careful data curation, high-quality synthetic datasets, and a combination of supervised finetuning and reinforcement learning, which together enhance reasoning capabilities.
Reinforcement learning adds a significant performance boost, though it comes with increased inference compute, highlighting reasoning as a transferable meta-skill that benefits from targeted training.
These models exhibit strong generalization to out-of-domain reasoning tasks such as 3SAT, TSP, and calendar planning, despite not being explicitly trained on these challenges.
Phi-4-reasoning models have demonstrated practical reasoning skills through examples like generating a website with HTML and creating a Python program for bouncing balls.
While evaluation shows high variance due to stochastic generation, models like Phi-4-reasoning-plus deliver more consistent and robust performance across tasks.
Parallel test-time scaling further enhances performance, nearly saturating the AIME 2025 benchmark and even surpassing the teacher model in some cases.
Despite their smaller size, Phi models outperform or match larger open and closed-weight models across a broad range of benchmarks, including math, science, coding, and spatial reasoning.
Summary based on 1 source
Get a daily email with more AI stories
Source

Microsoft Research • Jul 8, 2025
Phi-Reasoning: Once again redefining what is possible with small and efficient AI - Microsoft Research