This is part of our AI 101 series – and what could be more 101 than reinforcement learning? It’s everywhere in the conversation right now, so we put together a clear guide on what RL is and where it’s headed. Enjoy.
Reinforcement Learning (RL) – the idea of agents learning through trial and error – was shaping real-world systems long before it became the backbone of modern AI. Richard Sutton, often called the father of RL, recently gave an interview that pushed us to think bigger: what could RL unlock for the global AI industry if we truly commit to trial-and-error learning for agents?
But to analyze the future, we need to stand firmly on the foundation of the past and present.
So today we’ll walk you through RL’s milestones, from the very beginning: the math foundations and first approaches like Temporal-Difference (TD) learning, actor-critic methods, and Monte Carlo techniques such as REINFORCE (widely revisited today due to its similarity with modern GRPO), to the Deep RL revolution of the 2010s and RL’s widespread applications today via Reinforcement Learning from Human Feedback (RLHF), and main approaches like PPO and GRPO.
And regarding the future of RL, let’s analyze how Andrej Karpathy and Richard Sutton see it.
In today’s episode, we will cover:
Earliest Foundations of RL: Psychology and Math
Dynamic Programming and MDP
Samuel’s Checkers program and MENACE machine
The birth of the reinforcement learning
Temporal-Difference learning
Monte Carlo methods: learning from complete returns
REINFORCE: The policy gradient method everyone is revisiting now
Actor-critic methods
Deep Reinforcement Learning Revolution in 2010s
Google DeepMind’s Alpha milestones: What do they mean for RL?
RL for LLMs
Reinforcement Learning from Human Feedback (RLHF)
From PPO to GRPO
Conclusion: Why today’s LLMs and RL aren’t enough
Sources and further reading
Earliest Foundations of RL: Psychology and Math
Reinforcement learning (RL) grew out of two separate areas that later merged in one big field that developers keep enriching today:
Trial-and-error learning (psychology part) inspired by two main studies:
Edward Thorndike’s law of effect (1898) which says: "Responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation."
Operant conditioning (B.F. Skinner, 1930s–1950s): Building on Thorndike, Skinner showed how voluntary behaviors are shaped by their consequences. Actions followed by rewards (reinforcements) become more likely to be repeated, while those followed by negative outcomes (punishments) tend to decrease

Image Credit: Wikipedia
Thorndike showed this with his puzzle-box cats learning to escape through trial and error, while Skinner advanced this with his “Skinner Box,” using controlled experiments with animals to study how different reinforcement schedules affect behavior.
Optimal control (math and engineering). It is focused on how to control systems in the best possible way, like guiding a rocket or managing resources.
The second area laid the groundwork for many early and up-to-date RL methods. Let’s break down from what it all began.
Dynamic Programming and MDP
To start its long journey, RL needed a strong math and program foundation, which appeared in 1950s with the Richard Bellman’s work on Dynamic Programming and the Bellman equations for optimal control. This provided the “principle of optimality” – a way to compute optimal policies by breaking problems into a sequence of subproblems.
When combined with the concept of Markov chains (introduced by Russian mathematician Andrey Markov in the early 1900s), researchers later developed the framework of Markov Decision Processes (MDPs). An MDP captures decision-making in uncertain situations:
the Markov property means the next state depends only on the current one
while the “decision process” means that actions influence what happens next.

Image Credit: Markov decision process (MDP), Wikipedia
In RL, MDP is a go-to framework for modeling how an agent interacts with its environment. The setup boils down to three things:
States (where the system was),
Actions (what it did), and
Rewards (what it got as feedback, ending up in a new state).

Image Credit: Adaptive RL-based Routing Protocol for Wireless Multihop Networks
Here we can form a definition: reinforcement learning is a branch of machine learning that focuses on teaching models how to make decisions by trial and error. The goal is to find a policy (a way of choosing actions) that works well in an environment where the rules are not known in advance, and the agents have nothing but learn from experience. Everything is uncertain, and the challenge is to learn the best policy using as few trials as possible to maximize rewards.
This is how we understand the core idea of RL today. Do you see it? This mixes trial-and-error psychology with optimal control math, creating foundation for learning process.
What were the early RL algorithms that first followed this concept?
Samuel’s Checkers program and MENACE machine
Arthur Samuel’s checkers program (1950–60s) was one of the first true demonstrations of machine learning, and also RL. Samuel introduced search techniques like alpha-beta pruning to evaluate promising positions in checkers and, gave the program the ability to learn from experience.
Alpha = the best score the maximizing player (say, you) is guaranteed so far.
Beta = the best score the minimizing player (the opponent) is guaranteed so far.
By recording board positions and whether they led to wins or losses, the program updated its evaluations over time, steadily improving its play through thousands of self-play games. Samuel even coined the term machine learning in 1959 to describe this process, and in 1962 his program famously defeated a checkers master.

Image Credit: “Some Studies in Machine Learning Using the Game of Checkers” by A.L. Samuel
Another example comes from 1961. AI researcher Donald Michie built a mechanical “computer” out of 304 matchboxes to play tic-tac-toe (also called noughts and crosses). This machine was known as the Matchbox Educable Noughts and Crosses Engine (MENACE).

Image Credit: Menace: the Machine Educable Noughts And Crosses Engine by Oliver Child, chalkdust magazine
To make a move, the operator picked the matchbox for the current board state. Inside were colored beads, each marking a possible move. The box was shaken, and the bead that rolled out chose MENACE’s play. The bead’s color showed where to place MENACE’s “O.” The game went on like this, box by box, until it ended.
After each game, MENACE’s beads were updated: winning moves got extra beads, making them more likely, while losing moves had beads removed, making them less likely. This simple feedback loop was an early demonstration of RL in action.
However, the emergence of RL as an independent field happened a little bit later.
The birth of the reinforcement learning
Richard Sutton, who is often called one of the fathers of reinforcement learning, together with Andrew Barto formalized reinforcement learning in their famous book “Reinforcement Learning: An Introduction”, firstly published in 1998.
But before that, a huge amount of work was done to create the entire new field with different methods and ideas. The very moment of RL birth happened when the three main threads were finally merged:
trial-and-error psychology,
optimal control, and
temporal-difference methods.
And there is a lot to say about the last one.
Temporal-Difference Learning
In 1988, Richard Sutton introduced Temporal-Difference (TD) learning – a method for computers or agents to learn to make predictions about the future by using past experience even when the system they are in isn’t fully known. Unlike traditional prediction methods that compare predictions to actual outcomes (for example, supervised learning), TD learning uses intermediate hints. It doesn’t wait for the final outcome, it compares successive predictions and updates incrementally, learning a little bit every step. An easy example: Monday’s weather guess for Saturday can already be adjusted on Tuesday, using Tuesday’s improved guess.
You can see TD as minimizing prediction error through gradient steps. The key difference is the target used:
For supervised learning target is actual final outcome.
TD’s target is next prediction.
TD learning approach appeared to be a more accurate one, because it avoids being misled by chance. Imagine a board game, where a new position that leads into a usually bad state might still end in a lucky win. Taking this “win“ result into account, supervised learning would wrongly label it as good, while TD learning links it only to the bad state and so learns correctly.
So TD learning has notable advantages:
It saves memory and computation, because it doesn’t have to store all predictions until the end.
Faster and more accurate predictions.
It works well in real-world settings where waiting for final outcomes takes to much time.
In 1992 the concept of temporal-difference (TD) learning was used in an important RL milestone – TD-Gammon, a neural network created by IBM researcher Gerald Tesauro and trained to play backgammon. It played about 300,000 games against itself, where every game generated experience: board states, moves, outcomes. With this “knowledge”, TD-Gammon steadily improved its judgment of moves and reached near-professional strength.

Image Credit: IBM blog “The games that helped AI evolve”
Also, many years later, in 2015, Richard Sutton and Brian Tanner extended TD learning idea into TD networks, connecting multiple predictions – each prediction can depend on others in the future. These systems allow time-based predictions (e.g. what will happen 5 steps later) and can build predictive state representations when the environment isn’t fully observable.
Straight after the TD learning, in 1989, Christopher Watkins and Peter Dayan invented the Q-learning model-free RL algorithm. As Sutton’s TD learning it also learned just from experience and didn’t need a full model or map of the world.
In Q-learning, the agent keeps track of the “quality”, updating Q-value, of taking certain actions in certain states. Over many episodes, as the agent repeatedly explores all state-action pairs, the Q-values gradually converge to the optimal action-values. Once converged, the agent can act, choosing the action with the highest Q-value in each state.
Later, in 1994, SARSA algorithm, proposed by G.A. Rummery and M. Niranjan, came out. It is an on-policy TD control method that learns action-values (Q), using the action taken under the current policy. It uses the full sequence of experience: (State, Action, Reward, Next State, Next Action), which gives the name S-A-R-S-A. The process goes like this:
Start in a state and pick an action using the current policy, for example ε-greedy strategy that balances exploration and exploitation.
Take the action, see the reward and the next state.
From that new state, pick the next action (again using the same policy).
Update the Q-value for the first state–action pair using the reward and the predicted value of the new state–action pair.
Repeat until the episode ends.
SARSA keeps learning while also improving the policy bit by bit. Over time, if every state–action pair is explored enough, the algorithm converges to the optimal policy.
So SARSA and Q-learning are the illustrative examples of on and off-policy RL algorithms:
On-policy (like SARSA) → An agent learns about the policy that it is currently following. So it uses one policy for both acting and learning.
Off-policy (Q-learning) → An agent acts with one policy (behavior policy), but learns about a different one (target policy) that is usually a better one.
Apart from the TD learning branch another important, model-free path was formalizing “learn from experience” concept – and it is Monte Carlo methods.
Monte Carlo methods: learning from complete returns
While TD methods update predictions step by step using other predictions (bootstrapping), Monte Carlo (MC) methods learn directly from the actual return – the sum of rewards, collected when an episode (the sequence of steps until the agent reaches the goal) finishes.
Typical MC method follows the general policy iteration (GPI) framework:
It generates an episode using the current (exploring) policy.
Computes returns for each time step by summing rewards to the end of the episode.
Accumulates averages to update state and action value estimates.
As the MC algorithm learns which choices lead to better outcomes, it shifts its decision rule toward them.
To avoid getting stuck, the system keeps trying out different moves from time to time.
So, as a result, we have the repeated generalized policy iteration (GPI): evaluate with returns → improve policy → evaluate again.
MC algorithms’ strong point is that they only need samples and learn purely from real outcomes, averaging real interaction with the environment.
MC naturally led to the next step in the RL story, but this time training the policy directly. That’s where REINFORCE comes in.
REINFORCE: The policy gradient method everyone is revisiting now
Everyone is now urgently exploring REINFORCE algorithm (1992, Ronald Williams). It is because of today’s most popular Group Relative Policy Optimization (GRPO) that seems very close to it. (That’s why at Turing Post we explore history and earlier methods – they’re the key to today’s machine learning.)
So what is this REINFORCE?
It is a Monte Carlo policy-gradient method that also runs episodes, uses full-episode returns as the learning signal, and nudges policy parameters to make rewarding action choices more likely. And more precisely – REINFORCE is a family of RL algorithms for neural networks with randomness built in. They adjust the network’s weights after each trial (episode step) based on:
The received reward (r).
A baseline value (b), which acts like an average reward for comparison.
An eligibility term, which identifies how much a given weight influenced the chosen action.
The update rule is: Weight change = learning rate × (reward – baseline) × eligibility.
This formula explains the name REINFORCE: REward INcrement = Nonnegative factor × (reward – baseline) × Characteristic Eligibility.
REINFORCE is an important basic method because:
It’s simple and unbiased.
Doesn’t need a critic model.
Remains a conceptual foundation for many modern algorithms, including GRPO.
Monte Carlo and TD learning methods focus either on learning value functions of states (how good it is to be there) or state-action pairs (how good it is to take that move in that spot), or direct policy updates. But there is another way to perform RL that includes using critic models.
Actor-critic methods
Researchers in the late 1970-1980s (notably Barto, Sutton, and Anderson) developed actor–critic architectures, which combined the two parts:
The actor as the decision-maker selecting actions according to a policy.
The critic that evaluates those actions by estimating value functions, using TD learning.
The two work together: the critic provides feedback (a temporal-difference error) that tells the actor how to adjust its policy parameters – strengthen or weaken the choice of action. Over time, this back-and-forth helps the agent improve its behavior.

Image Credit: “Reinforcement Learning: An Introduction” by Richard S. Sutton and Andrew G. Barto
This approach overcomes the high variance of policy gradients and the limits of value-based methods, which don’t directly optimize policies.
Actor-critic methods work well when there are too many possible actions to check one by one. Plus it doesn’t always pick the same move, but assigns probabilities to actions.
Actor-critic algorithms became the foundation for many modern methods, such as Advantage Actor-Critic (A2C), Asynchronous A3C, Deep Deterministic Policy Gradient (DDPG), and also one of the most popular methods now – Proximal Policy Optimization (PPO).
The RL history of the 20th century laid the foundation, and the 21st century brings not only breakthroughs but also reveals limitations: what works and what doesn’t in the race for efficiency. Now we move to this new 21st century stage.
Deep Reinforcement Learning Revolution in 2010s
In the 2000s, RL research continued steadily, though without the fanfare of later decades. A major focus was how to scale RL to larger, more realistic problems with function approximation, stabilizing policy optimization, and exploring hierarchy. Some of the examples are Natural Policy Gradient (2001), Least-Squares Policy Iteration, LSPI (2001), R-MAX (2002), Neural Fitted Q Iteration (2005).
By the late 2000s, advances in GPUs and large datasets transformed supervised learning with deep neural networks, laying the foundation for the new stage of RL – deep reinforcement learning (fascinating story of ImageNet+GPU+AlexNet is covered here).
This phase began with the founding of DeepMind in 2010 by Demis Hassabis, Shane Legg, and Mustafa Suleyman. Their breakthrough came with the Deep Q-Network (DQN), first presented in 2013 and made famous by the 2015 Nature paper. DQN combined Q-learning with convolutional neural networks to approximate the Q-value function and play Atari games directly from raw pixels. Here is how it works:
DQN uses a variant of Q-learning with stochastic gradient descent.
It employs experience replay, storing past experiences (state, action, reward, next state) and sampling them randomly during training to break correlations and stabilize learning.
A target network allows to avoid instabilities caused by bootstrapping.
Overall, DQN learns end-to-end from video frames and reward signals, without handcrafted features.
DQN was the first deep RL system to reach human-level play across Atari games, surpassing earlier methods and even expert humans. It proved that deep networks could power modern RL.
Also, around 2015, policy gradient and actor–critics methods started to be updated with the new efficient ideas. Different approaches like Trust Region Policy Optimization (TRPO), Deep Deterministic Policy Gradient (DDPG), Asynchronous Advantage Actor–Critic (A3C) appeared and in 2017 the outstanding Proximal Policy Optimization (PPO), introduced by OpenAI, came out.
Why is PPO so popular?
It’s main feature is stability. It updates policies in small, controlled steps by clipping probability ratios, preventing large and risky changes. PPO combines policy improvement, value estimation, and an entropy bonus for balanced exploration, and uses Generalized Advantage Estimation (GAE) to smooth reward signals. This approach is simple, robust, and effective across diverse tasks, and that’s why it became one of the most widely used algorithms in current history of RL.
Meanwhile, DeepMind was creating a bunch of development, showing the huge potential of RL in different application fields.
Google DeepMind’s Alpha milestones: What do they mean for RL?
Strict RL breakthroughs
AlphaGo (2016) – policy + value networks with MCTS; trained on human expert games then refined by reinforcement learning via self-play. Defeated Go champion Lee Sedol 4–1.
AlphaZero (2017) – same core design, but trained purely through self-play with no human data. Generalized to Go, chess, and shogi.
MuZero (2019) – extended AlphaZero by learning its own model of the environment’s rules and dynamics, still planning with MCTS.
AlphaTensor (2022) – applied AlphaZero-style RL to algorithm discovery, finding faster matrix-multiplication algorithms.
AlphaDev (2023) – used RL to uncover faster sorting and hashing algorithms, now integrated into widely used software libraries.
RL-inspired and adjacent systems
AlphaFold (2020) – deep learning, not technically RL, but its iterative refinement loop echoed trial-and-error principles to predict protein structures with unprecedented accuracy.
AlphaProof (2024) – brought AlphaZero-style self-play to mathematical theorem proving, learning proof strategies and refining them through feedback.
AlphaGeometry (2024, 2025) – combined symbolic reasoning with neural networks and reinforcement-style search to solve Olympiad-level geometry problems, reaching gold-medalist performance in its second version.
AlphaEvolve (2025) – blended reinforcement learning with evolutionary methods to iteratively discover and refine new algorithms, part of DeepMind’s “general-purpose science AI” push.
AlphaDrive (2025) – applied RL together with vision–language models to real-world autonomous driving, integrating perception, planning, and decision-making.
These breakthroughs caught a lot of attention during a decade and shows the progression of RL capabilities over years. But what reinforcement learning in Lange Language Models?
RL for Language Models
The dominance of LLMs opened remarkable generative capabilities, but still pointed out at a challenge: Can they predict not just text, but text that’s useful, safe, and aligned with intent? That is where RL entered center stage. Ideas developed for games, control, and simulated environments were applied to human preference alignment.
Reinforcement Learning from Human Feedback (RLHF)
The concept of Reinforcement Learning from Human Feedback (RLHF) became a defining moment for aligning LLMs with human values with the following strategy:
Pretrain a large model on internet-scale text (so it can generate fluent language).
Collect human preference data – annotators rank which outputs are better, safer, or more relevant.
Train a reward model on this ranking data to approximate human judgment.
Fine-tune the base model with RL (usually PPO), maximizing the learned reward while keeping the model close to the pretrained distribution.
This pipeline was popularized by OpenAI in the training of InstructGPT (2022) and became the industry’s go-to method. We have a detailed overview of RLHF here.
The fact that “RL is usually PPO” hints at the next wave of developments. PPO relies on a separate critic network, which adds memory and compute overhead during training. This spurred researchers to explore lighter alternatives – critic-free or baseline-efficient methods like DPO, GRPO, and RLOO – to make RLHF training more efficient and scalable.
From PPO to GRPO
One of the most popular now is DeepSeek’s Group Relative Policy Optimization (GRPO) that returns back to policy-gradient–style thinking.
GRPO can be thought of as a modern REINFORCE-style algorithm, tailored for LLMs. Instead of updating based on a single trajectory and reward, GRPO compares groups of model outputs against each other. The group-relative advantage provides a variance-reduced learning signal, closer in spirit to the “reward minus baseline” update rule of REINFORCE. This approach:
Removes the need for complex critic networks.
Provides a simpler, more scalable recipe than PPO.
Stabilizes training for trillion-parameter-scale reasoning models like DeepSeek-R1.
In one of our previous episodes we covered GRPO workflow with every details, so you can explore it here.

Image Credit: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
However, last weekend GRPO caused a huge debate in the community, expressing doubts that it actually works properly. We also explored why this happened in our last newsletter.
Both PPO and GRPO are a part of a huge family of policy optimization techniques that is expanding almost every week with the newcomers. We gathered some of them in our post on Hugging Face:

Other developing areas on RL now are:
Reinforcement Learning with Verifiable Rewards (RLVR): It’s part of the current trend of moving from subjective alignment (humans labeling outputs) to objective alignment, using rewards that can be proven correct or automatically checked, like math or code. However, it’s still a quiet early approach.
Multi-objective RL: It introduces balancing acts like helpfulness vs. harmlessness vs. efficiency, rather than a single scalar reward.
So what’s for the future?
Concluding thoughts: Why today’s LLMs and RL aren’t enough
From Thorndike’s puzzle-box cats and Bellman’s dynamic programming to PPO, GRPO, and beyond, reinforcement learning has always returned to the same principle: intelligence emerges through trial, feedback, and adaptation. But where to move further?
Andrej Karpathy, a leading voice in deep learning, ex-OpenAI founding member and now building Eureka Labs, remains optimistic about agent-based learning but is cautious about current RL methods. He points out that reward functions – the basis for RLHF and policy gradients – are weak and hackable, 'super sus,' as he put it. They don’t generalize to open-ended reasoning and conversation. He suggests that RLHF can feel more like preference matching than true problem-solving.
Still, Karpathy notes that reinforcement fine-tuning has made models more practical than supervised training alone, and he expects its role to grow, though he questions how far the classic reward-model + policy-gradient framework can go.
Looking ahead, he believes breakthroughs will require new paradigms: interactive environments and experimental ideas like system-prompt learning, where models adapt by rewriting their own context instead of just updating weights.
Richard Sutton’s critique of LLMs in recent conversation with Dwarkesh Patel also highlights a fundamental divide between supervised prediction and RL. LLMs optimize next-word prediction on static text, but lack explicit states, actions, and rewards, and so can’t adapt through consequences. Sutton contrasts this with RL agents, which learn by interacting with environments. His analogy “reading every cookbook vs. actually cooking” illustrates the real gap between imitation and grounded understanding. He warns that LLMs as “offline learners” may plateau on human-written data, while RL agents, as “online learners,” can continue to adapt and generalize.
This raises a central research question: will intelligence emerge from scaling static predictors, or from agents that learn through interaction? What about Causal AI and Symbolic AI? And what might the next stage of RL look like as we move further into continual trial-and-error learning?
An absolute golden time for researchers to finally combine it all.
Sources and further reading
Some Studies in ML Using the Game of Checkers by A.L. Samuel (pdf)
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
Learning to Predict by the Methods of Temporal Differences by Richard S. Sutton (1988)
Temporal-Difference Networks by Richard S. Sutton and Brian Tanner
Q-Learning by Christopher J. C. H. Watkins and Peter Dayan
The games that helped AI evolve (IBM blog)
Reinforcement Learning from Human Feedback by Nathan Lambert
Resources from Turing Post
How did you like it?


