What is Natural Language Reinforcement Learning (NLRL)?

Recently we have spoken about Natural Language Reinforcement Learning (NLRL) on our social media, and it got a lot of feedback on Twitter, so we decided to expand the discussion of this interesting approach. We’ll talk about it in more detail, addressing the questions that our readers had.

So what is NLRL about and why should you know about it? NLRL is about adapting Reinforcement Learning (RL) concepts to work in a space where the key element is natural language. In NLRL, the core parts of RL like goals, strategies, and evaluation methods are redefined using natural language. Combined with LLMs, NLRL becomes practical and can be implemented either through simple prompts or by tweaking the model’s parameters. Let’s dive into what’s so special and revolutionary in NLRL and why it could be better than traditional RL.

In today’s episode, we will cover:

Why isn't reinforcement learning always enough?
Here comes NLRL
How does NLRL work?
- Redefining key concepts of RL with natural language
- Methods to evaluate language policy
LLMs as a good fit for NLRL
How good is NLRL in practice?
Advantages of NLRL
The main question - limitations
Conclusion
Bonus: Resources to dive deeper

Limitations of Traditional RL

Reinforcement Learning is a way of teaching machines to make decisions by framing problems as mathematical tasks, using a system called the Markov Decision Process (MDP). This method has led to breakthroughs in areas like gaming and robotics, but it has some issues. Traditional RL often struggles because it:

lacks prior knowledge: it doesn’t start with helpful information about the task and requires a lot of trial and error to learn how things work;
is hard to interpret: even advanced RL models like AlphaZero make decisions that are difficult to explain;
has unstable training: RL relies on simple numeric rewards for feedback, which can be limiting, especially in real-world tasks where richer feedback, like text or visuals, is available.

In easy words, RL’s “rigid math” lacks the flexibility and interpretability of natural language. RL can help models reason in language (like Chain-of-Though), but a key puzzle is figuring out how to let the model judge its own progress in natural language. So the main questions are:

How do you measure if the model is on the right path in its reasoning, using only words?
How can this evaluation be done unsupervised, meaning without human-provided labels or examples?

Natural Language Reinforcement Learning Explained

To address these issues, a group of researchers from University College London, Shanghai Jiao Tong University, Brown University, National University of Singapore, University of Bristol, and University of Surrey (what a varied list!) proposed an approach called Natural Language Reinforcement Learning (NLRL). As we’ve already mentioned, instead of relying purely on math, NLRL draws inspiration from how humans use language to understand tasks, plan strategies, and explain their reasoning.

— # (#)

The idea is to take the core concepts of RL – such as strategies (language policy), objectives, and evaluations – and reinterpret them in a natural language setting. This allows LLMs to learn directly from interactions with their environment, without needing pre-existing human data or labeled data, and to provide intuitive, language-based explanations for their decisions and richer feedback.

Why did the researchers decide to work specifically with RL?

As one of the researchers, Xidong Feng, wrote on his Twitter, he was driven by the idea of making LLMs truly understand games through language – not just the rules, but also strategies and evaluation. His previous attempt with the ChessGPT model showed that an approach based on using human gameplay data available online is too costly, messy (due to inconsistent data), and not scalable for new games and tasks. That's why relying on human data in this case isn’t practical.

So he turned to RL, which has already shown good results when combined with natural language, as seen in CoT. Now, let’s dive deeper into the natural language “upgrade” of reinforcement learning.

How does NLRL work

Redefining key concepts of RL with natural language

Text-based MDP (Markov Decision Process)

In traditional RL, a Markov Decision Process (MDP) represents the decision-making environment using states, actions, transitions, and rewards. NLRL redefines these elements in terms of text:

State (S): It represents the current situation or environment. They are described using natural language, like “You are at the cross.”
Action (A): Actions are represented as language-based decisions, such as “Go straight”, “Move up”.
Feedback (transitions and rewards): Is provided as textual descriptions rather than numerical values, for example “You reach the goal”. For example, NLRL starts with a language goal, like "reach the goal" or "open the door."

Image Credit: Original paper

Language task instruction

NLRL uses a task instruction written in natural language, such as “reach the goal” or “open the door.” It acts as a reference point for evaluating the agent's performance. The goal is to measure how well the agent’s actions align with the task. This is done by evaluating the trajectory description, which turns the agent's actions into a language summary. The objective is to optimize the agent’s policy so that the trajectory description aligns with the task instruction.

Now, let’s talk about what this policy is.

Language policy

The goal of traditional RL is to learn a policy, which is like a strategy that tells the agent the best action to take in a given state. The optimal policy maximizes the expected long-term reward.

Instead of directly choosing an action based on probabilities, NLRL integrates a Chain-of-Thought process, so language policy includes:

Strategic reasoning, logical steps, and planning, written in natural language.
Actions generated in two steps:
- Generation of the reasoning process.
- Generation of the action based on the reasoning.

Language value function

Instead of original RL’s state value and state-action value, NLRL introduces language value functions to evaluate states and actions in terms of natural language:

Language State Value: Evaluates how effective a state is for achieving the task.
Language State-Action Value: Evaluates a state-action pair’s effectiveness.

These value functions provide more interpretable and detailed feedback compared to traditional numerical scores. For example, they can include logical reasoning behind decisions, predictions about future outcomes, and comparisons of different actions.

Language Bellman Equation

The Bellman equation is a way to calculate the value of a state based on its immediate reward and the value of future states. This equation makes it easier to calculate values step-by-step.

NLRL adapts this idea into the language Bellman equation, which uses:

Language descriptions of intermediate transitions.
Information aggregation functions (G1 and G2):
- G1: Combines feedback across different actions and transitions (like averaging in traditional RL).
- G2: Combines immediate feedback with the future evaluation, mimicking the summation step in the original Bellman equation.

Methods to evaluate language policy

NLRL is all about adapting RL methods to work in the natural language field. That’s why to measure how good a language policy, or strategy, works, researchers redefined standard RL methods. So here we have:

Language Monte Carlo (MC) Estimate:

Monte Carlo (MC) methods are a class of algorithms that rely on repeated random sampling to estimate numerical results, often used in reinforcement learning (RL) to evaluate outcomes of complete trajectories and provide direct, unbiased estimates of returns. These methods explore the space of possible outcomes using randomness, making them versatile and effective.

In Natural Language Reinforcement Learning (NLRL), MC methods adapt to textual domains by starting from a state and simulating multiple complete text-based trajectories, or "rollouts," using the agent's current policy. Each trajectory represents a sequence of actions and outcomes expressed in natural language. Unlike traditional RL, where numerical returns can be averaged, NLRL requires an aggregator function G1 to combine natural language descriptions into a summary value, addressing the challenge of aggregating qualitative information for quantitative evaluation.

Strength: MC estimates avoid "bias" since they use complete trajectories to make direct evaluations.
Weakness: They can have high variance because future steps can vary a lot, making it harder for G1 to combine information effectively.

Image Credit: Original paper

Language Temporal Difference (TD) Estimate:

Instead of using full trajectories, TD focuses on just one step forward. It uses the language Bellman equation to break down the value of a state into two parts: the immediate reward and the value of the next state. In other words, this involves aggregating one-step language descriptions using G1 function and G2 to combine immediate and future evaluations.

Strength: TD reduces variance by focusing on short-term variations rather than full trajectories. It’s also a fast method.
Weakness: It can introduce "bias" because the next state's value is only an estimate, which might not be accurate.

Image Credit: Original paper

LLMs in NLRL

To make all the NLRL concepts work, we need a key component, LLMs, because they can understand, process, and generate language. Within the NLRL framework, LLMs take on the roles of decision-makers, evaluators, and trainers:

LLMs as the language policy: LLMs act as the decision-making agent, generating actions based on natural language reasoning (like chain-of-thought processes). For example, "Move forward to reduce distance to the goal."
LLMs as the language value function: Similar to how traditional RL uses numerical value functions to guide decisions, LLMs generate insightful, text-based evaluations of states and actions, like "This move opens up future possibilities."
LLMs as language Monte Carlo (MC) and Temporal Difference (TD) operators: They combine immediate feedback, like an action’s result, with future evaluations to estimate the value of a state or action.
Distilling MC and TD estimates: LLMs learn to approximate language-based value functions through training on unsupervised interaction data.
LLMs as policy improvement operators: LLMs analyze different actions and their evaluations to select the best action for a given state.
Training the language policy: Instead of using unstable RL techniques like policy gradients, NLRL uses supervised training for stability and precision.

Image Credit: Original paper

NLRL uses LLMs instead of traditional RL mechanisms to perform tasks such as decision-making, evaluation in a feedback loop, making improvements, critiquing and planning using prompts, and explaining their choices. Unlike traditional RL models that produce numbers, LLMs generate human-readable explanations, making their decisions understandable. Also, they can be fine-tuned or prompted to fit specific tasks, which makes them flexible.

NLRL Performance Results

To measure how NLRL approach improves performance, researchers testes NLRL in the following three scenarios:

Maze game experiments with T and Medium mazes to evaluate how language Temporal Difference (TD) and policy improvement enhance decision-making in mazes.
Breakthrough board game to train a language value function to evaluate states in the 5x5 Breakthrough board game.
Tic-Tac-Toe game with natural language actor-critic to validate the complete natural language actor-critic framework in a simpler game environment.

Here are the results of these experiments:

Maze game experiments:
- Language TD significantly outperformed basic prompt-based policies.
- More variations and longer look-ahead steps improved performance but had diminishing returns.
- Language policy alone achieved around -27 reward (poor performance), while language TD with 6 variations and 3 look-ahead steps got -12 reward (significantly better).
  Image Credit: Original paper
5x5 Breakthrough Board Game
- Baseline comparisons: Prompt-based evaluations with pre-trained LLMs like GPT-4 performed poorly with accuracy ~61%.
- Language TD impact: The trained language value function achieved over 85% accuracy.
- Varying rollout steps and dataset sizes showed scalability and robustness but highlighted risks of overfitting with too much look-ahead.
Image Credit: Original paper

Tic-Tac-Toe game with natural language actor-critic

NLRL achieved higher win rates against both deterministic and random opponents compared to baselines. It’s even performed better than PPO (Proximal Policy Optimization) in RL.
Image Credit: Original paper
Increasing training epochs improved stability.
Using larger buffers to retain past experiences mitigated forgetting.
More rollout trajectories per iteration led to robust learning curves.

These experiments highlight the potential of NLRL to combine the interpretability of natural language with the rigor of reinforcement learning.

As usual, let’s now summarize the benefits of NLRL.

Advantages of NLRL

We’ve already mentioned many good aspects, but here’s a list of NLRL key advantages:

Use of prior knowledge: By tapping into the vast information stored in large language models (LLMs), NLRL can quickly adapt to new tasks.
Better interpretability: NLRL uses natural language to make decisions and provide explanations of the agent’s reasoning that are easier to understand. All the outputs are human-readable.
Richer feedback: It can incorporate textual feedback, making training more stable and effective.
Combining with LLMs benefits: By leveraging the strengths of LLMs (reasoning, aggregation, and language generation), NLRL creates systems that are both effective and interpretable in language-based environments.
Better planning and critique: NLRL can enhance a language model’s ability to plan and critique its own actions using simple instructions.
Evaluations: It trains models to assess situations through a natural language value function, offering reliable evaluations.
Clear reasoning: A training pipeline similar to the traditional “actor-critic” approach allows the system to learn entirely from text-based feedback, producing clear and interpretable reasoning.
Scalability: The NLRL framework can handle diverse environments, from mazes to board games, and showed potential for generalization.

However, most of the questions on NLRL were about its limitations. So let’s look at them more precisely, as they may not be so obvious.

Current Limitations of NLRL

The first serious limitation of NLRL is bounded action space.

Game experiments demonstrate its performance in discrete spaces with actions like "move up," "go straight," or choosing one of nine positions on the grid. However, continuous action spaces or high-dimensional tasks, such as robotics, remain unexplored. Therefore, we don’t know if NLRL could work on such tasks.

The second significant limitation is the high computational cost due to LLMs, which require substantial resources for decision-making. As a result, NLRL is restricted to small-scale applications.

Additionally, a further limitation coming with LLMs is that NLRL is less time-efficient compared to traditional RL methods with smaller networks.

Conclusion

Reimagining classical machine learning concepts could be a promising path to achieving more efficient systems. NLRL marks the beginning of a new chapter, where RL principles are seamlessly integrated with the richness of natural language. Some may argue why we need natural language interpretations if everything already works well enough. But why not? AI is rapidly changing, and in some cases, language interpretability and a human-friendly approach may be crucial, inheriting the flexibility of natural language as NLRL does.

Bonus: Resources to dive deeper

What is Natural Language Reinforcement Learning (NLRL)?

Limitations of Traditional RL

Natural Language Reinforcement Learning Explained

Why did the researchers decide to work specifically with RL?

How does NLRL work

Redefining key concepts of RL with natural language

Text-based MDP (Markov Decision Process)

Language task instruction

Language policy

Language value function

Language Bellman Equation

Methods to evaluate language policy

LLMs in NLRL

NLRL Performance Results

Advantages of NLRL

Current Limitations of NLRL

Conclusion

How did you like it?

Reply

FOD#160: What Is Latent Reasoning? How AI Can Think Without Words

AI Protocols: 11 Standards Every Builder Should Know

How Local AI Ecosystems Are Rewriting the Global AI Assistant Race