The State Of LLMs 2025: Progress, Problems, and Predictions
Summary
The LLM landscape in 2025 has shifted from pure architectural scaling to the implementation of reasoning-focused architectures using Reinforcement Learning with Verifiable Rewards (RLVR) and GRPO. This transition enables models to improve accuracy through explicit reasoning traces and more efficient post-training methodologies that rely on deterministic feedback rather than expensive human preference labels.
Key Points
- DeepSeek R1 Breakthrough: The release of DeepSeek R1 demonstrated that reasoning-like behavior can be developed via reinforcement learning, significantly reducing the perceived cost of training state-of-the-art models (with estimates as low as $5 million compared to previous $50M–$500M projections).
- RLVR Implementation: Reinforcement Learning with Verifiable Rewards (RLVR) allows for post-training using deterministic correctness labels, specifically in domains like mathematics and programming.
- Algorithmic Evolution: The primary focus of LLM development has transitioned through several stages: RLHF/PPO (2022), LoRA/SFT (2023), Mid-training (2024), and RLVR/GRPO (2025).
- GRPO Optimizations: Recent research has introduced mathematical improvements to the Group Relative Policy Optimization (GRPO) algorithm, including zero gradient signal filtering, active sampling, token-level loss, and the removal of KL loss (e.g., Dr. GRPO).
- Inference-Time Scaling: A growing trend involves increasing computational expenditure during the inference phase to trade higher latency and cost for improved response accuracy.
Technical Details
The technical core of 2025's progress lies in the move away from the bottlenecks of Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), which require expensive, manually written responses or preference labels. RLVR addresses this by utilizing "verifiable" rewards—deterministic outcomes such as whether a piece of code executes correctly or a math problem reaches the correct solution. This allows for scaling compute during the post-training phase by using large amounts of verifiable data.
The GRPO algorithm has become a central research focus, with several recent iterations optimizing the training pipeline. Notable modifications include the DAPO approach, which incorporates zero gradient signal filtering, active sampling, and truncated importance sampling. Furthermore, the industry is exploring "explanation-scoring," where a secondary LLM is used to judge the quality of a model's reasoning traces, potentially overcoming the computational overhead issues previously associated with Process Reward Models (PRMs). Looking forward, the development of continual learning techniques aims to mitigate "catastrophic forgetting," allowing models to integrate new knowledge without full retraining.
Impact / Why It Matters
The emergence of high-performance, open-weight reasoning models like DeepSeek R1 provides developers with access to advanced logic capabilities that were previously restricted to proprietary APIs. Additionally, the shift toward inference-time scaling introduces a new architectural trade-off: developers can now intentionally trade increased latency and compute costs for significantly higher accuracy in mission-critical applications.