- Turing Post
- Posts
- AI 101: Reinforcement Learning: The Ultimate Guide to Past, Present, and Future
AI 101: Reinforcement Learning: The Ultimate Guide to Past, Present, and Future
From the early trial-and-error concepts to today’s breakthroughs with RLHF, PPO, and GRPO, and where to go next, according to Andrej Karpathy and Richard Sutton.
This is part of our AI 101 series – and what could be more 101 than reinforcement learning? It’s everywhere in the conversation right now, so we put together a clear guide on what RL is and where it’s headed. Enjoy.
Reinforcement Learning (RL) – the idea of agents learning through trial and error – was shaping real-world systems long before it became the backbone of modern AI. Richard Sutton, often called the father of RL, recently gave an interview that pushed us to think bigger: what could RL unlock for the global AI industry if we truly commit to trial-and-error learning for agents?
But to analyze the future, we need to stand firmly on the foundation of the past and present.
So today we’ll walk you through RL’s milestones, from the very beginning: the math foundations and first approaches like Temporal-Difference (TD) learning, actor-critic methods, and Monte Carlo techniques such as REINFORCE (widely revisited today due to its similarity with modern GRPO), to the Deep RL revolution of the 2010s and RL’s widespread applications today via Reinforcement Learning from Human Feedback (RLHF), and main approaches like PPO and GRPO.
And regarding the future of RL, let’s analyze how Andrej Karpathy and Richard Sutton see it.
In today’s episode, we will cover:
Earliest Foundations of RL: Psychology and Math
Dynamic Programming and MDP
Samuel’s Checkers program and MENACE machine
TD) learning
Monte Carlo methods: learning from complete returns
REINFORCE: The policy gradient method everyone is revisiting now
Actor-critic methods
Deep Reinforcement Learning Revolution in 2010s
Google DeepMind’s Alpha milestones: What do they mean for RL?
RL for LLMs
Reinforcement Learning from Human Feedback (RLHF)
From PPO to GRPO
Conclusion: Why today’s LLMs and RL aren’t enough
Sources and further reading
Earliest Foundations of RL: Psychology and Math
Reinforcement learning (RL) grew out of two separate areas that later merged in one big field that developers keep enriching today:
Trial-and-error learning (psychology part) inspired by two main studies:
Edward Thorndike’s law of effect (1898) which says: "Responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation."
Operant conditioning (B.F. Skinner, 1930s–1950s): Building on Thorndike, Skinner showed how voluntary behaviors are shaped by their consequences. Actions followed by rewards (reinforcements) become more likely to be repeated, while those followed by negative outcomes (punishments) tend to decrease
Image Credit: Wikipedia
Thorndike showed this with his puzzle-box cats learning to escape through trial and error, while Skinner advanced this with his “Skinner Box,” using controlled experiments with animals to study how different reinforcement schedules affect behavior.
Optimal control (math and engineering). It is focused on how to control systems in the best possible way, like guiding a rocket or managing resources.
The second area laid the groundwork for many early and up-to-date RL methods. Let’s break down from what it all began.
Dynamic Programming and MDP
To start its long journey, RL needed a strong math and program foundation, which appeared in 1950s with the Richard Bellman’s work on Dynamic Programming and the Bellman equations for optimal control. This provided the “principle of optimality” – a way to compute optimal policies by breaking problems into a sequence of subproblems.
When combined with the concept of Markov chains (introduced by Russian mathematician Andrey Markov in the early 1900s), researchers later developed the framework of Markov Decision Processes (MDPs). An MDP captures decision-making in uncertain situations:
the Markov property means the next state depends only on the current one
while the “decision process” means that actions influence what happens next.

Image Credit: Markov decision process (MDP), Wikipedia
In RL, MDP is a go-to framework for modeling how an agent interacts with its environment. The setup boils down to three things:
States (where the system was),
Actions (what it did), and
Rewards (what it got as feedback, ending up in a new state).

Image Credit: Adaptive RL-based Routing Protocol for Wireless Multihop Networks
Here we can form a definition: reinforcement learning is a branch of machine learning that focuses on teaching models how to make decisions by trial and error. The goal is to find a policy (a way of choosing actions) that works well in an environment where the rules are not known in advance, and the agents have nothing but learn from experience. Everything is uncertain, and the challenge is to learn the best policy using as few trials as possible to maximize rewards.
This is how we understand the core idea of RL today. Do you see it? This mixes trial-and-error psychology with optimal control math, creating foundation for learning process.
What were the early RL algorithms that first followed this concept?
Samuel’s Checkers program and MENACE machine
Arthur Samuel’s checkers program (1950–60s) was one of the first true demonstrations of machine learning, and also RL. Samuel introduced search techniques like alpha-beta pruning to evaluate promising positions in checkers and, gave the program the ability to learn from experience.
Alpha = the best score the maximizing player (say, you) is guaranteed so far.
Beta = the best score the minimizing player (the opponent) is guaranteed so far.
By recording board positions and whether they led to wins or losses, the program updated its evaluations over time, steadily improving its play through thousands of self-play games. Samuel even coined the term machine learning in 1959 to describe this process, and in 1962 his program famously defeated a checkers master.

Image Credit: “Some Studies in Machine Learning Using the Game of Checkers” by A.L. Samuel
Another example comes from 1961. AI researcher Donald Michie built a mechanical “computer” out of 304 matchboxes to play tic-tac-toe (also called noughts and crosses). This machine was known as the Matchbox Educable Noughts and Crosses Engine (MENACE).

Image Credit: Menace: the Machine Educable Noughts And Crosses Engine by Oliver Child, chalkdust magazine
To make a move, the operator picked the matchbox for the current board state. Inside were colored beads, each marking a possible move. The box was shaken, and the bead that rolled out chose MENACE’s play. The bead’s color showed where to place MENACE’s “O.” The game went on like this, box by box, until it ended.
After each game, MENACE’s beads were updated: winning moves got extra beads, making them more likely, while losing moves had beads removed, making them less likely. This simple feedback loop was an early demonstration of RL in action.
However, the emergence of RL as an independent field happened a little bit later.
The birth of the reinforcement learning
Richard Sutton, who is often called one of the fathers of reinforcement learning, together with Andrew Barto formalized reinforcement learning in their famous book “Reinforcement Learning: An Introduction”, firstly published in 1998.
But before that, a huge amount of work was done to create the entire new field with different methods and ideas. The very moment of RL birth happened when the three main threads were finally merged:
trial-and-error psychology,
optimal control, and
temporal-difference methods.
And there is a lot to say about the last one.
Temporal-Difference Learning
Join Premium members from top companies like Microsoft, Google, Hugging Face, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Learn the basics👆🏼

Reply