• Turing Post
  • Posts
  • AI 101: The State of Reinforcement Learning in 2025

AI 101: The State of Reinforcement Learning in 2025

What we have – and what we can carry from 2025 into 2026 – in reinforcement learning

Reinforcement Learning (RL) – a topic that’s constantly present in 2025. Ever since DeepSeek-R1 kicked off the year with a major reasoning breakthrough, researchers and developers have increasingly turned to RL-based training as a way to strengthen reasoning and agentic behavior. We saw hundreds of research on RL everywhere: in LLMs, VLMs coding domain, agentic systems, robotics, and beyond.

Unlike supervised learning (depends on labeled datasets), or unsupervised learning (searches for patterns without guidance), RL is built on interaction. The system learns by doing, by nudging the environment and watching how it reacts. Rewards, penalties, surprises, and mistakes shape the policy more than any static dataset ever could. In other words, it pushes models closer to real-world behavior.

But is reinforcement learning really as strong as it looks right now? This year also surfaced a set of concerns about the strategy.

Today we’re zooming out to see what RL achieved in 2025, where it ran into walls, and what momentum it brings into 2026. Let’s begin the full year-in-review.

Reinforcement Learning is a lot worse than I think the average person thinks. Reinforcement Learning is terrible. It just so happens that everything that we had before in much worse.

Andrej Karpathy

In today’s episode, we will cover:

  • Revisiting some basics

  • Reinforcement Learning from Human Feedback (RLHF) becomes classic

  • From human to AI judgement: Reinforcement Learning from AI Feedback (RLAIF)

  • Reinforcement Learning with Verifiable Rewards (RLVR) – The new promise?

  • Why does Andrej Karpathy think that RL is terrible?

  • The most popular Policy Optimization algorithms in 2025

  • What will we take into 2026? Important trends and advances in RL:

    • RL Scaling

    • Reinforcement Pre-Training (RPT)

    • Multi-objective RL

    • Agentic RL

    • RL in Robotics

  • Not without limitations

  • Conclusion

  • Sources and further reading

Revisiting some basics

Reinforcement learning is a type of machine learning where a model learns through trial-and-error interactions with an environment. Traditionally, RL centered on building explicit agents that take actions and receive rewards. The landscape shifted this year: RL techniques moved well beyond classical agent settings and showed up across model training, optimization, reasoning, and even data curation.

At its core, the setup is simple. The agent takes an action, observes the outcome, and receives a reward or penalty based on how good that action was. Over time it tries to learn a policy that picks actions leading to high long-term reward. RL problems are usually framed as a Markov Decision Process (MDP) with the following elements:

  • States: the situations the agent can find itself in

  • Actions: the choices available to the agent

  • Transitions: how actions move the agent from one state to another

  • Rewards: the feedback signal guiding learning

When RL works with LLMs:

  • The state is usually the entire prompt or conversation.

  • The action space is the model’s whole large vocabulary.

RL algorithms aim to find the optimal policy, meaning the best possible strategy for earning the most reward over time. RL lets models learn from types of feedback that are not differentiable, such as human opinions or preference comparisons. It is especially useful for two things:

  • Alignment: The model can learn which responses people prefer, even when there’s no clear “correct” answer.

  • Reasoning: The model can be rewarded for producing correct step-by-step reasoning that leads to accurate solutions.

2025 taught us a lot about both parts.

Reinforcement Learning from Human Feedback (RLHF) becomes classic

Let’s start with alignment. Models can hallucinate, produce biased or unsafe content, and stumble on complex instructions. Alignment matters because it nudges AI toward being more helpful and more consistent, whether it’s a chatbot holding a coherent conversation or a code model finally passing its unit tests.

Reinforcement Learning from Human Feedback (RLHF) became the default alignment strategy for LLMs in 2025. It typically follows three steps:

  • Supervised Fine-Tuning (SFT): The model is trained on examples written by humans. This teaches it what “good” responses look like and provides a strong starting point.

  • Reward Model Training: Humans compare pairs of model responses and pick the better one. A “reward model” learns to predict these preferences.

  • Reinforcement Learning itself: The LLM is then optimized with algorithms like Proximal Policy Optimization (PPO) so it learns to produce responses the reward model scores highly. A KL penalty acts as a safety rail by preventing the policy from drifting too far from the SFT model, which keeps training stable and stops the model from warping its original knowledge.

However, there are some difficulties with human feedback: it is expensive and slow to collect. That's why, in our era of AI-powered optimization, what has also become popular is Reinforcement Learning from AI Feedback (RLAIF).

Image Credit: “RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback” paper

From human to AI judgement: Reinforcement Learning from AI Feedback (RLAIF)

RLAIF follows the same steps as RLHF, but instead of collecting human preference data, it replaces or supplements human evaluations with AI evaluators that judge model outputs. These evaluators may be:

  • powerful general-purpose models

  • specialized classifiers trained to detect specific kinds of problems, like toxicity, bias, factuality, etc.

  • ensembles of smaller, targeted models

The reward model is then trained to match the AI feedback, and different optimization algorithms are used to fine-tune the LLM just like in RLHF.

Benefits of RLAIF:

  • Far more scalable than human labeling

  • Cheaper and faster

  • Often more consistent than human annotators

But there are some limitations too:

  • AI evaluators may not perfectly reflect human values.

  • Evaluator biases can be passed on to the trained model.

  • Feedback loops may amplify errors over time.

One of the interesting examples of RLAIF is Curriculum-RLAIF that helps the reward model generalize better and produce stronger alignment during RL. It trains reward models the same way you’d teach a student: start with simple, clear-cut examples and gradually introduce harder, noisier ones. Easy pairs are generated using guided prompts, hard pairs – using random sampling, and medium-difficulty “bridge” pairs in between. These pairs are sorted from easy to hard and fed to the reward model in that order.

Image Credit: Curriculum-RLAIF original paper

Beyond alignment, RL has also been used to boost reasoning abilities of models, leading to the emergency of the entire class of Reasoning Models, with even early examples like DeepSeek-R1 and Kimi-1.5.

Reinforcement Learning with Verifiable Rewards (RLVR) – The new promise?

The beginning of 2025 shocked everyone with the release of DeepSeek-R1 on January 20. It was the first model to deliver advanced reasoning on par with OpenAI’s o1 (at that time) and make it available to the public. For the first time, we could actually watch the reasoning unfold, since the open model produced long, detailed chains of thought that showed how it arrived at its answers. The backbone for this was the Group Relative Policy Optimization (GRPO) algorithm, which sits inside a broader training approach that strengthens the model’s reasoning ability called Reinforcement Learning with Verifiable Rewards (RLVR).

RLVR is a part of the current trend of moving from subjective alignment, like humans labeling outputs, to objective alignment, using rewards that can be proven correct or automatically checked, like math or code.

Here is how RLVR works:

  • A model tries different reasoning paths and generates its final output (answer, code, plan, tool call, reasoning steps).

  • A verifier checks it via unit tests, math solvers, consistency rules, tool execution, logic constraints.

  • Then the verifier returns a binary or numeric reward.

  • The model is optimized (often via GRPO/PPO-style RL) using this reward.

Researchers from LeapLab (Tsinghua University) and Shanghai Jiao Tong University decided to test whether RLVR is actually as impressive as many claim.

A perfect roundup for you to refresh your understanding and see how far RL evolved this year. A quick reset before we dive into the new wave of ideas shaping 2026.

Join Premium members from top companies like Microsoft, Nvidia, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI. Learn the basics and go deeper👆🏼

Reply

or to participate.