Turing Post
Posts
Topic 46: RLHF variations: DPO, RRHF, RLAIF

Topic 46: RLHF variations: DPO, RRHF, RLAIF

we discuss three different methods for human alignment of LLMs that propose an alternative to the most widely-used RLHF

Alyona Vert.
June 25, 2025

Our readers showed a huge interest in the human-in-the-loop concept, and the next important topic is how to align models with human preferences. One of the most popular training techniques here is Reinforcement Learning from Human Feedback (RLHF), which is now explicitly used in the most advanced Reasoning Models. Traditional RLHF relies on complex reinforcement learning (RL) and a learned reward function. However, this entire method sometimes can be unstable or hard to tune.

In this huge field of human alignment optimization, we can’t rely on just one method as a one-size-fits-all solution. What’s why the family of methods for trust calibration has many members. But today we’ll focus only on three of them – very interesting variations of RLHF: Direct Preference Optimization (DPO), Reward-Rank Hindsight Fine-Tuning (RRHF), and RL from AI Feedback (RLAIF). Each of them brings a specific feature to make alignment optimization more efficient – some avoid using RL, some skip reward models, some avoid both or introduce a different approach to what human alignment is.

We’re here to clarify what these methods can offer, their benefits and limitations, and what’s better to choose in different cases. So let’s uncover what DPO, RRHF, and RLAIF (including d-RLAIF) are all about (it will also help with navigating around all these acronyms).

By the way, if you’re surprised we didn’t include GRPO – the alignment technique that shone in DeepSeek’s models – it’s just that we already covered it here.

In today’s episode, we will cover:

A little bit more about RLHF
Direct Preference Optimization (DPO)
- How does DPO work step-by-step?
- Why DPO wins over RLHF?
- DPO’s current limitations
RRHF: Reward-Rank Hindsight Fine-Tuning
- RRHF workflow
- Advantages of RRHF
- Not without limitations
Reinforcement Learning from AI Feedback (RLAIF): Replacing Humans with AI
- How to actually replace human feedback with AI?
- Does RLAIF wins over RLHF?
- RLAIF’s issues
Conclusion
Sources and further reading

A little bit more about RLHF

Since 2017 – and more actively between 2020 and 2022 – RLHF, as proposed by OpenAI, may have seemed like a minor technical improvement, but it fundamentally transformed how AI learns to generate responses. Before RLHF, LMs were kind of stuck. They were trained on everything humans had written – the good, the bad, and even the utterly unhelpful. They became super successful at mimicking human text patterns but really bad at understanding what humans really need from their responses. RLHF was like a turning point – the moment when AI learned to truly serve humanity.

RLHF helped models began producing responses that better matched what people actually wanted. In other words, human preferences started to be used to train a reward model. Researchers from OpenAI instead of saying to LLMs "write like humans," asked them "which response do humans prefer?"

They created a smartly working loop, with 3 steps:

Collecting human preferences: People compared AI-generated content, like short video clips of agents behavior or text summaries of posts, and simply picked the better one. These comparisons are used as labels to train a reward function.
Training a reward model: A neural network is trained to predict human preferences by learning a reward function that explains the human's choices.
Reinforcement learning (RL): The original model is learned to optimize for human approval, not just human imitation and training data. The trained reward model guides the original model with standard RL algorithms, like policy gradient methods, as if the reward came from the environment – even though it comes from human preferences.

Image Credit: “Deep Reinforcement Learning from Human Preferences” paper

Repeating this loop gives an opportunity to gather more preferences and retraining, gradually improving the model.

RLHF fundamentally changed the relationship between humans and machines, replacing hardcoded rewards with learned, human-aligned objectives.

The most widespread RL algorithm used in RLHF approach is Proximal Policy Optimization (PPO). Here’s just a quick reminder of what PPO is, because we need this for some future comparisons: PPOs idea is to improve policy steadily without collapsing. It uses feedback from a reward model, scoring the output’s quality, and a reference model, monitoring deviations using KL divergence (Kullback-Leibler diergence). A separate value model predicts expected reward. PPO calculates the advantage, which is actual vs. expected performance, and then updates the policy using a clipped objective, preventing large shifts in behavior.

Image Credit: Alyona Vert, Turing Post

Over the past few years, RLHF has surged from niche curiosity to the steel spine underpinning today’s most advanced Reasoning Language Models (RLMs). But it’s not the one and only option for effective training. Many other techniques were created as better variants or alternatives to RLHF.

DPO, RRHF and RLAIF – what is behind these mysterious acronyms? Let’s see.

Direct Preference Optimization (DPO)

The main idea of Direct Preference Optimization (DPO) is to introduce a much easier alternative to RLHF. Instead of going through all the RL steps, Stanford University and CZ Biohub researchers found a way to:

Join Premium members from top companies like a16z, Microsoft, Google, Hugging Face, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Simplify your learning journey 👆🏼

Reply

or to participate.