This website uses cookies

Read our Privacy policy and Terms of use for more information.

Reinforcement Learning (RL) – a topic that’s constantly present in 2025. Ever since DeepSeek-R1 kicked off the year with a major reasoning breakthrough, researchers and developers have increasingly turned to RL-based training as a way to strengthen reasoning and agentic behavior. We saw hundreds of research on RL everywhere: in LLMs, VLMs coding domain, agentic systems, robotics, and beyond.

Unlike supervised learning (depends on labeled datasets), or unsupervised learning (searches for patterns without guidance), RL is built on interaction. The system learns by doing, by nudging the environment and watching how it reacts. Rewards, penalties, surprises, and mistakes shape the policy more than any static dataset ever could. In other words, it pushes models closer to real-world behavior.

But is reinforcement learning really as strong as it looks right now? This year also surfaced a set of concerns about the strategy.

Today we’re zooming out to see what RL achieved in 2025, where it ran into walls, and what momentum it brings into 2026. Let’s begin the full year-in-review.

Reinforcement Learning is a lot worse than I think the average person thinks. Reinforcement Learning is terrible. It just so happens that everything that we had before in much worse.

Andrej Karpathy

In today’s episode, we will cover:

  • Revisiting some basics

  • Reinforcement Learning from Human Feedback (RLHF) becomes classic

  • From human to AI judgement: Reinforcement Learning from AI Feedback (RLAIF)

  • Reinforcement Learning with Verifiable Rewards (RLVR) – The new promise?

  • Why does Andrej Karpathy think that RL is terrible?

  • The most popular Policy Optimization algorithms in 2025

  • What will we take into 2026? Important trends and advances in RL:

    • RL Scaling

    • Reinforcement Pre-Training (RPT)

    • Multi-objective RL

    • Agentic RL

    • RL in Robotics

  • Not without limitations

  • Conclusion

  • Sources and further reading

Revisiting some basics

Reinforcement learning is a type of machine learning where a model learns through trial-and-error interactions with an environment. Traditionally, RL centered on building explicit agents that take actions and receive rewards. The landscape shifted this year: RL techniques moved well beyond classical agent settings, and showed up across model training, optimization, reasoning, and even data curation and as Richard Sutton and David Silver argue in their Era of Experience framework, this shift is structural: the next phase of AI will be driven not by human data, but by agent experience.

At its core, the setup is simple. The agent takes an action, observes the outcome, and receives a reward or penalty based on how good that action was. Over time it tries to learn a policy that picks actions leading to high long-term reward. RL problems are usually framed as a Markov Decision Process (MDP) with the following elements:

  • States: the situations the agent can find itself in

  • Actions: the choices available to the agent

  • Transitions: how actions move the agent from one state to another

  • Rewards: the feedback signal guiding learning

When RL works with LLMs:

  • The state is usually the entire prompt or conversation.

  • The action space is the model’s whole large vocabulary.

RL algorithms aim to find the optimal policy, meaning the best possible strategy for earning the most reward over time. RL lets models learn from types of feedback that are not differentiable, such as human opinions or preference comparisons. It is especially useful for two things:

  • Alignment: The model can learn which responses people prefer, even when there’s no clear “correct” answer.

  • Reasoning: The model can be rewarded for producing correct step-by-step reasoning that leads to accurate solutions.

2025 taught us a lot about both parts.

Reinforcement Learning from Human Feedback (RLHF) becomes classic

Let’s start with alignment. Models can hallucinate, produce biased or unsafe content, and stumble on complex instructions. Alignment matters because it nudges AI toward being more helpful and more consistent, whether it’s a chatbot holding a coherent conversation or a code model finally passing its unit tests.

Reinforcement Learning from Human Feedback (RLHF) became the default alignment strategy for LLMs in 2025. It typically follows three steps:

  • Supervised Fine-Tuning (SFT): The model is trained on examples written by humans. This teaches it what “good” responses look like and provides a strong starting point.

  • Reward Model Training: Humans compare pairs of model responses and pick the better one. A “reward model” learns to predict these preferences.

  • Reinforcement Learning itself: The LLM is then optimized with algorithms like Proximal Policy Optimization (PPO) so it learns to produce responses the reward model scores highly. A KL penalty acts as a safety rail by preventing the policy from drifting too far from the SFT model, which keeps training stable and stops the model from warping its original knowledge.

However, there are some difficulties with human feedback: it is expensive and slow to collect. That's why, in our era of AI-powered optimization, what has also become popular is Reinforcement Learning from AI Feedback (RLAIF).

Image Credit: “RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback” paper

From human to AI judgement: Reinforcement Learning from AI Feedback (RLAIF)

RLAIF follows the same steps as RLHF, but instead of collecting human preference data, it replaces or supplements human evaluations with AI evaluators that judge model outputs. These evaluators may be:

  • powerful general-purpose models

  • specialized classifiers trained to detect specific kinds of problems, like toxicity, bias, factuality, etc.

  • ensembles of smaller, targeted models

The reward model is then trained to match the AI feedback, and different optimization algorithms are used to fine-tune the LLM just like in RLHF.

Benefits of RLAIF:

  • Far more scalable than human labeling

  • Cheaper and faster

  • Often more consistent than human annotators

But there are some limitations too:

  • AI evaluators may not perfectly reflect human values.

  • Evaluator biases can be passed on to the trained model.

  • Feedback loops may amplify errors over time.

One of the interesting examples of RLAIF is Curriculum-RLAIF that helps the reward model generalize better and produce stronger alignment during RL. It trains reward models the same way you’d teach a student: start with simple, clear-cut examples and gradually introduce harder, noisier ones. Easy pairs are generated using guided prompts, hard pairs – using random sampling, and medium-difficulty “bridge” pairs in between. These pairs are sorted from easy to hard and fed to the reward model in that order.

Image Credit: Curriculum-RLAIF original paper

Beyond alignment, RL has also been used to boost reasoning abilities of models, leading to the emergency of the entire class of Reasoning Models, with even early examples like DeepSeek-R1 and Kimi-1.5.

GRPO and RLVR: Surprising Findings in 2025

The beginning of 2025 shocked everyone with the release of DeepSeek-R1 on January 20. It was the first model to deliver advanced reasoning on par with OpenAI’s o1 (at that time) and make it available to the public. For the first time, we could actually watch the reasoning unfold, since the open model produced long, detailed chains of thought that showed how it arrived at its answers. The backbone for this was the Group Relative Policy Optimization (GRPO) algorithm, which sits inside a broader training approach that strengthens the model’s reasoning ability called Reinforcement Learning with Verifiable Rewards (RLVR).

RLVR is a part of the current trend of moving from subjective alignment, like humans labeling outputs, to objective alignment, using rewards that can be proven correct or automatically checked, like math or code.

Here is how RLVR works:

  • A model tries different reasoning paths and generates its final output (answer, code, plan, tool call, reasoning steps).

  • A verifier checks it via unit tests, math solvers, consistency rules, tool execution, logic constraints.

  • Then the verifier returns a binary or numeric reward.

  • The model is optimized (often via GRPO/PPO-style RL) using this reward.

Researchers from LeapLab (Tsinghua University) and Shanghai Jiao Tong University decided to test whether RLVR is actually as impressive as many claim.

In their paper called “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?”, they tested RLVR-trained models across different model families, RL methods, and benchmarks for math, coding, and visual reasoning and evaluated them by checking pass@k at very large k values – when the model is allowed many chances to sample a correct solution.

Their findings challenge the common assumptions:

  • RLVR makes the model more efficient – it gets to the correct reasoning path faster, with fewer samples and more often. For example, it improves pass@1.

  • But it also reduces the model’s reasoning range.

Image Credit: “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” paper

When we allow multiple samples, the base models actually outperform their RLVR-trained versions. This suggests that RLVR models lean on reasoning paths the base model already knew how to produce. They don’t discover fundamentally new strategies. RLVR mostly shifts the sampling distribution so those paths appear more often.

Coverage and perplexity analyses confirm this: the reasoning steps produced by RLVR models already exist somewhere in the sampling distribution of the base model. All of this leads to one conclusion: RLVR model’s abilities come from and are limited by the base model itself.

But later, researchers from Microsoft Research Asia and Peking University published a “Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs” paper which argues with these results.

Oh we love a good paper fight.

What did they do? By repeating pass@k experiments for math and code tasks, the researchers show that RLVR solves problems the base model cannot, even with many tries.

They also noted though that pass@k can be misleading on math tasks, because it scores only the correct answers, not correct reasoning. Sometimes the base model gives wrong reasoning but still lands on the right final answer by chance. So they proposed a new metric, CoT-Pass@K, which checks if both the final answer and the reasoning steps (CoT) are correct. Using this stricter metric, they observe a clear improvement after RLVR – reasoning itself gets better.

Image Credit: “Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs” paper

Their main insight on why RLVR works is this: pretrained LLMs already know a lot about what good reasoning looks like, but RLVR increases the probability of those correct reasoning paths because they align with the reward for correct, verifiable answers. Even if the reward only checks the final answer, the gradient still pushes the model toward generating valid reasoning steps.

They also found that:

  • RLVR starts encouraging correct reasoning early in training.

  • These improvements generalize to new test questions.

  • CoT quality also improves.

So the verdict from this study is that RLVR does expand the reasoning abilities of models. It’s best understood through one key idea: RLVR works when it pushes the model toward verifiable reasoning paths, not just correct answers.

This paper argues that using an LLM as a reasoning judge is a promising direction and that we need better benchmarks to evaluate how reliable these LLM judges are.

As you see, it all comes down to the evaluation issue again.

For convenience, let’s summarize how to distinguish the core RL approaches:

  • Human-labeled reward → RLHF

  • AI-labeled reward → RLAIF

  • Automatically verifiable reward → RLVR

Since we’ve already touched on the criticism of RL, let’s go to →

Why does Andrej Karpathy think that RL is terrible?

To understand this, we need to revisit Karpathy’s explanation of how RL works.

For example, suppose a model needs to solve a math problem. In RL, it will first try many approaches and variations in parallel. These hundreds of attempts can be quite complex. Once the model produces an answer, it compares it with the correct one and receives a reward if it matches. The model then infers that the entire trace of decisions (or possibly several traces) led to the correct outcome. But the thing is — and this connects to a broader question about when models should stop reasoning altogether:

“Every single token gets up-weighted – do more of this. People will say that your estimator has high variance, but it’s just noisy,” Karpathy says. He notes that RL does two main things that human would never do:

  1. Each single step that brought the model to the correct solution is not necessary the right thing to do, but it becomes up-weighted and turns into noise. Karpathy used a perfect metaphor of “sucking the bits of supervision of the final reward signal through a straw.”

  2. Human would never do hundreds of rollouts. Humans revisit the entire process of thinking, analyzing which parts of the solution worked good and which ones were bad or could be done better. “There is nothing in current LLMs that does this,” Karpathy concludes.

So the question remains open: How far can the classic reward-model + policy-gradient framework go? One direction that addresses interpretability directly is Natural Language Reinforcement Learning (NLRL) — an approach that redefines RL's core concepts like policy, value function, and Bellman equation entirely in natural language.

It was definitely the year when dozens of policy optimizations algorithms could appear in just one week. Policy optimization refers to RL methods that directly adjust the parameters of a policy to increase expected return (the sum of rewards over time). These methods use gradient-based updates to push the policy toward outputs that yield higher rewards while maintaining stability (e.g., via clipping or trust regions). Here are a few policy optimization approaches that stand out the most:

  • Proximal Policy Optimization (PPO): Don’t forget the classics. PPO remains the most common RL algorithm for RLHF: stable, reliable, and effective in large action spaces. Human preferences train a reward model, and PPO adjusts the policy to favor preferred responses while limiting drastic shifts. Clipping and KL penalties keep the updates safe and stable.

  • Group Relative Policy Optimization (GRPO): GRPO has become a standard RL method for LLM reasoning tasks. Introduced by DeepSeek and used in models like DeepSeekMath, DeepSeek-R1, and DeepSeek-V3.2, it removes the need for a value function by normalizing rewards across multiple responses per prompt. The group-relative advantage gives a low-variance signal that stabilizes large-scale training, reduces noise, and works well for sparse-reward tasks like math. Thanks to efficiency and support for multi-step reasoning, GRPO is now central to RLVR-style models.

Image Credit: “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models” paper

DeepSeek-V3.2 applies GRPO with stability fixes – an unbiased KL estimator, off-policy masking, and consistent MoE routing. It unifies reasoning, agentic behavior, and alignment into one RL stage. After distillation from specialists, the model is refined with large-scale RL using rule-based and generative rewards, enabling reliable scaling and combining many skills in a single model.

  • Group Sequence Policy Optimization (GSPO): This approach from Qwen Team, Alibaba replaces token-level updates with sequence-level optimization. Instead of computing importance ratios for each token as GRPO does, GSPO computes a sequence-likelihood importance ratio, then applies clipping and rewards at the full-sequence level. It also normalizes rewards across multiple responses to the same prompt, producing stable sequence-level advantages.

  • Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO): Fixes instability in long chain-of-thought RL training. It uses Clip-Higher to loosen the upper clipping bound and prevent entropy collapse, keeping the model’s behavior diverse. It also adds Dynamic Sampling, which adjusts the number of rollouts during training to improve stability and efficiency.

  • BAlanced Policy Optimization (BAPO): Fixes instability in off-policy LLM training. PPO and GRPO often collapse because negative samples dominate and fixed clipping blocks useful exploration. BAPO instead uses adaptive clipping bounds that adjust to the ratio of positive vs. negative signals in each batch, allowing beneficial low-probability updates while dampening harmful negatives. This stabilizes training, improves exploration, and delivers strong reasoning results on AIME 2024/2025.

Image Credit: BAPO original paper

  • Agentic Reinforced Policy Optimization (ARPO): Designed for multi-turn LLM agents with tool use. After a tool call, token entropy spikes, signaling uncertainty. ARPO uses this to trigger adaptive rollouts, sampling extra branches to explore tool behaviors. Advantage attribution then assigns credit to main and branched paths, teaching the model which tool-use steps helped. A collection of policy optimization methods appears at the end of the article.

What will we take into 2026? Important trends and advances in RL

RL Scaling

If models improve their performance when scaled, why not implement this idea in RL? Researchers from Princeton University and Warsaw University of Technology demonstrate that scaling RL networks much deeper up to 1024 layers can greatly boost the performance of self-supervised, goal-conditioned reinforcement learning. Using Contrastive RL (CRL) with residual blocks, LayerNorm, and Swish activations, they train agents in sparse-reward Brax and MJX environments and find large gains:

  • 2–5× on manipulation tasks,

  • 20× on long-horizon mazes,

  • 50×+ on humanoid control.

Image Credit: “1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities” paper

CRL learns by pulling together embeddings of real (state, action, goal) triples and pushing apart mismatched ones. Deeper models develop richer representations and more capable behaviors once they cross certain “critical depth” thresholds. Greater depth also allows larger batch sizes and better generalization, although it offers limited benefit in offline RL where exploration is missing.

Reinforcement Pre-Training (RPT)

Microsoft Research proposed a way to scale language models by reframing next-token prediction as a small RL reasoning task. The model is trained with RL and receives a verifiable reward each time it predicts the correct next token from the text corpus. This approach allows RL to scale across massive web-text datasets without human labels or hand-crafted reward functions.

Image Credit: “Reinforcement Pre-Training” paper

RPT rewards the model for explaining why a token should follow, which encourages deeper next-token reasoning and builds a stronger foundation for later RL stages. A newer approach, PretrainZero, uses reinforcement active pre-training: the model selects which spans to mask and then learns to predict them using two RL-trained policies. This turns raw text into a self-supervised RL signal and bypasses the “verification data wall” created by limited ground-truth answers. By gradually increasing difficulty with harder masks, PretrainZero improves general reasoning and math skills even before post-training RL begins.

Multi-objective RL

The idea behind it is to train LMs to balance multiple goals at once, like being helpful but also concise or both harmless and creative, ensuring that improving one goal doesn’t ruin another. RLHF can’t do this because it optimizes a single reward. One of the interesting approaches here is PAreto Multi-Objective Alignment (PAMA) which reframes alignment as a convex optimization problem using a PPO variant (Noon PPO) that only applies positive-advantage updates. PAMA converges to a Pareto-stationary point, improving multiple objectives at once – for example, boosting sentiment and length rewards by 30–50% and increasing harmlessness by +0.5 to +1.0 over baselines.

Agentic RL

RL shows up in every domain now. Since it began as a framework for agents, agentic RL became the natural direction of progress in 2025.

The power moments are seen in:

  1. Software and coding: RL is particularly effective in this domains due to clear, verifiable reward signals: compiler errors, test outcomes, execution traces, performance metrics, and formal verification checks. These signals enable LLMs to develop multi-step coding abilities, like planning, debugging, and tool orchestration. For example, SWE-RL leverages real software-evolution data – code diffs, issues, and pull requests – to model and reconstruct developer reasoning, while Qwen3-Coder employs execution-based RL and long-horizon Agent RL, supported by over 20,000 parallel environments, to handle everything from planning features to fixing bugs and writing tests. Yes, this is already our reality.

  2. Mathematics

    • Informal math: RL helps models mix natural language with Python, call external tools, and self-correct intermediate steps. This boosts multi-step reasoning, debugging, and symbolic manipulation.

    • Formal math: Theorem proving can be framed as an MDP with verifier-based rewards. Systems like DeepSeek-Prover and Leanabell-Prover typically use MCTS, subgoal generation, and verifier feedback to construct valid proofs.

  3. GUI Agents progressed from zero-shot Vision–Language Models (VLMs) to supervised offline training and now to RL systems capable of multi-step interaction. Static RL approaches, like GUI-R1 or UI-R1, improves grounding using correctness and subgoal rewards, while interactive RL (for example, WebAgent-R1, MobileGUI-RL, ComputerRL) trains via online rollouts in real browsers and apps, using GRPO variants and large-scale async RL.

  4. Vision Agents: RL now boosts vision models by giving them precise rewards (IoU, mAP, keyframe scores) to improve detection, segmentation, video reasoning, and 3D understanding. RL also improves diffusion and autoregressive image generation through reward-guided updates, like Flow-GRPO. So there is a notable shift from just predicting pixels to optimizing directly for quality signals.

RL works not only for individual agents but can be also applied to multi-agent systems (MAS) to train agents jointly with the main optimization methods like PPO, GRPO and others.

RL in Robotics

Interest in robotics surged this year, and robots continue to learn through a mix of reinforcement learning and imitation learning (behavior cloning). RL trains a robot to learn a control policy that maximizes cumulative reward within an MDP, using algorithms such as TRPO (trust-region updates), PPO (clipped policy optimization), and SAC (an off-policy method that maximizes reward plus entropy). Training typically begins in simulation, with domain randomization to shrink the reality gap, while real-world learning improves through off-policy data reuse and human demonstrations.

Robots rely on tight feedback loops to manage noise and complex dynamics, and human feedback often mirrors RLHF. Human-in-the-loop systems like HIL-SERL have reached success rates near 99 percent on challenging manipulation tasks within one to two hours on low-cost robots.

Image Credit: Robot Learning: A Tutorial

That covered where and how RL is applied today. But across all these systems there are still crucial challenges the community is trying to solve.

Not without limitations

  • Models can “reward hack” by finding loopholes that maximize reward without actually improving their underlying capability. Models can "reward hack" by finding loopholes that maximize reward without actually improving their underlying capability — a pattern we also saw in production with o3 during METR's evaluations.

  • Designing a good reward signal is hard. Misspecified or overly simplistic rewards push the agent toward brittle or unintended behaviors, especially for complex objectives.

  • Training is expensive. Large models and agents need huge numbers of interactions to learn stable policies, which makes RL far more computationally demanding than supervised fine-tuning.

  • Feedback quality sets the ceiling. RL depends on human or AI feedback – or verifier accuracy – and ensuring that feedback is reliable, diverse, and fair is a difficult problem.

  • Updates can destabilize pretrained models. Without careful KL control and tuning, RL steps can distort the model’s knowledge or collapse its behavior.

  • Long-horizon credit assignment remains unsolved. When rewards are delayed or sparse, RL struggles to identify which early decisions mattered, which limits reliability on multi-step or reasoning-heavy tasks.

Conclusion

The state of RL in 2025 is one of optimistic progress mixed with a clear view of what still needs work. This year’s advances – LLM alignment via RL, RLVR in reasoning models, autonomous code-generation improvements, and more capable robotic planning – created real momentum. RL helped produce agents that can reflect and reason in ways that felt out of reach only a few years ago. Moving into 2026, the trend line points toward reinforcement learning becoming a more reliable, integrative layer inside AI systems rather than an isolated technique.

At the same time, questions about whether RL is the right long-term strategy for broad AI development remain on the table. The field wants better optimization across every step of a model’s solution path so we avoid amplifying weak reasoning and burning unnecessary compute. Verification is becoming the central pillar: stronger verifiers, more interpretable reasoning traces, and clearer signals that can steer models without destabilizing them. Much of the 2026 work is likely to focus on this intersection of optimization, verification, and stability.

And here’s the bigger perspective. RL has deep roots going back to the 1950s. It could stay central, or it could end up as one piece in a larger family of algorithms. The community is increasingly open to that possibility. The real promise of 2026 isn’t a bet on “better RL,” but an environment where RL mixes with new architectures, new objective functions, and new ways of shaping model behavior. It’s a year for trying ideas that don’t fit neatly into today’s playbook.

In 2026, we get to reopen the frontier and experiment again.

Sources and further reading

Policy Optimization Techniques

Resources from Turing Post

Reply

Avatar

or to participate

Keep Reading