This website uses cookies

Read our Privacy policy and Terms of use for more information.

Our readers showed a huge interest in the human-in-the-loop concept, and the next important topic is how to align models with human preferences. One of the most popular training techniques here is Reinforcement Learning from Human Feedback (RLHF), which is now explicitly used in the most advanced Reasoning Models.We first covered RLHF and PPO in action when Llama 2 was released in our weekly digest: FOD#12: Meta's Llama 2 Training Process. We also covered RLHF applied in a released model — Alfred-40B by LightOn, trained on Falcon RLHF — in our weekly digest: FOD#14: Does AI Alphabet Also Start with A? Traditional RLHF relies on complex reinforcement learning (RL) and a learned reward function. However, this entire method sometimes can be unstable or hard to tune.

In this huge field of human alignment optimization, we can’t rely on just one method as a one-size-fits-all solution. What’s why the family of methods for trust calibration has many members. But today we’ll focus only on three of them – very interesting variations of RLHF: Direct Preference Optimization (DPO), Reward-Rank Hindsight Fine-Tuning (RRHF), and RL from AI Feedback (RLAIF). Each of them brings a specific feature to make alignment optimization more efficient – some avoid using RL, some skip reward models, some avoid both or introduce a different approach to what human alignment is.

We’re here to clarify what these methods can offer, their benefits and limitations, and what’s better to choose in different cases. So let’s uncover what DPO, RRHF, and RLAIF (including d-RLAIF) are all about (it will also help with navigating around all these acronyms).

By the way, if you’re surprised we didn’t include GRPO – the alignment technique that shone in DeepSeek’s models – it’s just that we already covered it here.

In today’s episode, we will cover:

  • A little bit more about RLHF

  • Direct Preference Optimization (DPO)

    • How does DPO work step-by-step?

    • Why DPO wins over RLHF?

    • DPO’s current limitations

  • RRHF: Reward-Rank Hindsight Fine-Tuning

    • RRHF workflow

    • Advantages of RRHF

    • Not without limitations

  • Reinforcement Learning from AI Feedback (RLAIF): Replacing Humans with AI

    • How to actually replace human feedback with AI?

    • Does RLAIF wins over RLHF?

    • RLAIF’s issues

  • Conclusion

  • Sources and further reading

A little bit more about RLHF

Since 2017 – and more actively between 2020 and 2022 – RLHF, as proposed by OpenAI, may have seemed like a minor technical improvement, but it fundamentally transformed how AI learns to generate responses. Before RLHF, LMs were kind of stuck. They were trained on everything humans had written – the good, the bad, and even the utterly unhelpful. They became super successful at mimicking human text patterns but really bad at understanding what humans really need from their responses. RLHF was like a turning point – the moment when AI learned to truly serve humanity.

RLHF helped models began producing responses that better matched what people actually wanted. In other words, human preferences started to be used to train a reward model. Researchers from OpenAI instead of saying to LLMs "write like humans," asked them "which response do humans prefer?"

They created a smartly working loop, with 3 steps:

  • Collecting human preferences: People compared AI-generated content, like short video clips of agents behavior or text summaries of posts, and simply picked the better one. These comparisons are used as labels to train a reward function.

  • Training a reward model: A neural network is trained to predict human preferences by learning a reward function that explains the human's choices.

  • Reinforcement learning (RL): The original model is learned to optimize for human approval, not just human imitation and training data. The trained reward model guides the original model with standard RL algorithms, like policy gradient methods, as if the reward came from the environment – even though it comes from human preferences.

Image Credit: “Deep Reinforcement Learning from Human Preferences” paper

Repeating this loop gives an opportunity to gather more preferences and retraining, gradually improving the model.

RLHF fundamentally changed the relationship between humans and machines, replacing hardcoded rewards with learned, human-aligned objectives.

The most widespread RL algorithm used in RLHF approach is Proximal Policy Optimization (PPO). Here’s just a quick reminder of what PPO is, because we need this for some future comparisons: PPOs idea is to improve policy steadily without collapsing. It uses feedback from a reward model, scoring the output’s quality, and a reference model, monitoring deviations using KL divergence (Kullback-Leibler diergence). A separate value model predicts expected reward. PPO calculates the advantage, which is actual vs. expected performance, and then updates the policy using a clipped objective, preventing large shifts in behavior.

Image Credit: Alyona Vert, Turing Post

Over the past few years, RLHF has surged from niche curiosity to the steel spine underpinning today’s most advanced Reasoning Language Models (RLMs). But it’s not the one and only option for effective training. Many other techniques were created as better variants or alternatives to RLHF.

DPO, RRHF and RLAIF – what is behind these mysterious acronyms? Let’s see.

Direct Preference Optimization (DPO)

The main idea of Direct Preference Optimization (DPO) is to introduce a much easier alternative to RLHF. Instead of going through all the RL steps, Stanford University and CZ Biohub researchers found a way to:

  • Skip the separate reward model

  • Avoid using RL at all

How is this possible? Direct Preference Optimization means that this method trains the model directly from human preferences, using a simpler loss function.

DPO uses change of variables to transform a loss over reward functions into a loss over the language model itself, called the policy model. In simple words, it compares good vs. bad answers and encourages the model to prefer the better ones.

How does DPO work step-by-step?

  1. You start with a dataset of human feedback. Each data point includes a prompt, two model responses y₁ and y₂, and a label showing which response was better (more preferred by human), like yw= "winner", and yl = "loser".

  2. Set up of two models:

    • Reference model (πref): A frozen version of the model before fine-tuning that serves as a baseline.

    • Trainable model (πθ): A model that will be updated to better follow human preferences.

    The goal is to teach πθ to prefer answers what humans also choose, but without drifting too far from πref.

  3. Training the model with the DPO loss:

    DPO uses a reparameterization method to rewrite the preference model in terms of policies directly. Here is what happens in this algorithm:

  • Computing the log odds (relative likelihood) of the model preferring yw over yl relative to the reference model.

  • Applying a sigmoid to turn this into a preference probability.

  • The negative log-likelihood turns this into a binary classification loss, pushing the model to prefer the “winner” answer over the “loser” one."

  • β is a temperature hyperparameter that controls the sharpness of the model's updates.

Image Credit: DPO original paper

Overall, the gradient of the DPO loss encourages:

  • Increasing the likelihood of the preferred response

  • Decreasing the likelihood of the less preferred response

  • The intensiveness of the update is proportional to how wrong the model is. In DPO, the model also learns more from examples where it was clearly wrong, because those provide the most useful training signal.

This approach directly optimizes the policy to match the human preferences, with no need to define or fit a reward model. Overall, you can treat the language model itself as if it's implicitly learning a reward function just by optimizing a classification-style loss.

Why DPO wins over RLHF?

Well, DPO presents that direct math tricks can smartly optimize models training with the following advantages:

  • No need for RL loops.

  • No separate reward model.

  • It’s mathematically equivalent to optimizing rewards, but is easier and more stable.

  • Efficiency: DPO just needs a dataset of preferences and one model to train.

  • And best of all, it performs as well as or better than RLHF on many tasks like summarization, dialogue, and sentiment control.

As for the performance results, DPO:

  • Achieved a 61% win rate in summarization, beating PPO’s 57%.

  • Outperforms PPO when tested on new data it wasn’t trained on before.

  • In Sentiment Generation task, it achieved the best balance between reward and similarity.

  • And the main thing – humans preferred DPO’s outputs over PPO’s by 58% to 17%.

Even though DPO simplifies the entire process of human alignment optimization, there are some things it doesn’t handle well (yet).

DPO’s current limitations

  • It can’t collect new feedback during training: DPO only learns from existing preference data – no exploration, no interactive updates.

  • It doesn’t learn an explicit reward model, so you can’t reuse the “reward” logic across tasks or analyze it separately.

  • If the preferences DPO learns from are noisy or inconsistent, it can struggle.

  • It’s easier to use than PPO, but the temperature (β) and sampling setup still affect results.

Our next point is another fascinating approach that implements another “philosophy” to what human alignment is.

RRHF: Reward-Rank Hindsight Fine-Tuning

Reward-Rank Hindsight Fine-Tuning (RRHF) is another alternative to the classic RLHF pipeline that avoids explicit RL. Proposed by Alibaba DAMO Academy, it formulates alignment as a ranking problem over model-generated responses and avoid using of multiple models unlike PPO does. Here is how it applies this in practice.

RRHF workflow

Image Credit: RRHF original paper

  1. Collecting multiple responses: For each prompt or question (x), RRHF gathers k different answers (y₁, y₂, ..., yₖ). Unlike PPO, RRHF can use any kind of response, not just the ones from its own model. So these answers can come from:

    • the original model

    • the training model

    • other large language models like ChatGPT or GPT-4o

    • human-written responses

  2. Scoring the responses

    A reward function gives 2 scores to each answer:

    • A reward score (rᵢ) from human feedback or a reward model.

    • A model score (pᵢ) from the training model – this is how likely the model thinks that answer is, based on its internal probabilities. To calculate it, the model sums the log probabilities of each word in the response, normalized by length, so short answers aren’t unfairly favored.

  3. Training the model to prefer better answers: RRHF’s goal is to teach the model to prefer better answers, so if reward rⱼ > rᵢ, then the model should score pⱼ > pᵢ. This is achieved using ranking loss (Lrank), which core idea claims: "If the model gives a higher score to a worse response, we penalize it."

    This encourages the model to learn the correct ranking of responses based on human feedback.

  4. A bit of Supervised Fine-Tuning (SFT): To make the model learn more deeply from the best answer, cross-entropy loss is used. It pushes the model to imitate that top answer, and the Total loss = Ranking loss + SFT loss.

Overall, RRHF can be seen as an extension of RLHF or variation of other methods, for example:

  • If we use only one human-written response, RRHF becomes regular SFT.

  • The model can act like a reward model using probabilities instead of a separate classifier.

  • And basically it may be seen as PPO, because it trains a model to match human preferences, but it’s much easier.

Advantages of RRHF

As the previous DPO method, RRHF is usually compared to PPO. So, in this battle with PPO, RRHF is:

  • Easier to implement – needs only about 30 extra lines of code compared to standard fine-tuning.

  • Lighter – requires just 1 or 2 models (not 4), so it’s less memory-intensive.

  • No KL-divergence penalty, because sampling is done before training, not during.

Also, RRHF shows significant performance gains:

  • The best reward score -0.96 in single-turn dialogue task.

  • 61.75% accuracy when ranking human-preferred responses, which is well above vanilla LLMs and PPO.

  • RRHF-trained models were consistently preferred by human evaluators while tested.

Image Credit: RRHF original paper

However…

Not without limitations

  • RRHF relies heavily on the quality and integrity of the reward model, which may not fully reflect real human preferences.

  • Requires more GPU memory during training because it needs multiple responses for the proper workflow.

  • It becomes more complex and fragile when extended to online settings.

  • Can fall into the same traps of reward hacking as other RLHF approaches.

Finally, we’re moving to the last method. It introduces us to the idea that a human can be replaced by AI for greater efficiency, even when we are focused on human preferences. And someone would think – how can human data not be involved here? But there’s the answer.

Reinforcement Learning from AI Feedback (RLAIF): Replacing Humans with AI

We all already know (especially after our recent overview of synthetic data generation with human-in-the-loop), that gathering high-quality human data is quite slow and expensive, and that is why many turn to AI for help. Can we also use AI to give feedback instead of people in methods like RLHF? That’s what Reinforcement Learning from AI Feedback (RLAIF) approach by Google DeepMind exactly does.

Image Credit: RLAIF original paper

How to actually replace human feedback with AI?

  1. Proper prompting

To replace human annotators in RL pipelines, the researchers use a general-purpose “off-the-shelf” LLM not specialized for the target task. This model is prompted to decide which of two responses is better for a given input. But the effectiveness of this method depends on how well the prompt is structured and how reliably the model can express its preferences. So the AI is shown a prompt with these parts:

  • Preamble: Instructions like “Which response is better?”

  • Few-shot examples (optional): Examples of past responses with an explanation and a “winner” answer.

  • Actual task: A new input and two responses to compare.

  • Ending cue: A line like “Preferred Response =” to signal the model to choose.

The model then guesses “1” or “2,” and the probabilities of these guesses are used to create a soft preference score, for example, 60% for Response 1, 40% for Response 2, to understand how much the model preferred one answer over the other.

To reduce the positional bias, the researchers ran each comparison twice, switching the order of responses. Then they averaged the results to get a fair preference score.

  1. Encouraging AI to explain its choice via Chain-of-Though (CoT) Reasoning

If models slow down and explain their reasoning, this results in more accurate answers. Here in RHAIF, CoT reasoning makes the model give better preferences and more human-like judgments as it explains its choice:

  • First, it writes a rationale, for example “Response A is more accurate because…”

  • Then the original task + rationale is fed back into the model to generate a preference score.

Image Credit: RLAIF original paper

  1. Final step: Training the Policy Model

Once the preference data is ready, it’s used to train a policy model. There are two ways to do this:

  • Canonical RLAIF

    It’s a standard setup that mimics traditional RLHF workflow but with AI feedback:

    • The AI-generated preferences (soft labels) are used to train a Reward Model (RM) that predicts which responses are good or bad. It learns from the probability-weighted scores, like [0.7, 0.3], using a cross-entropy loss.

    • Then this RM is used to train the policy model via RL. The RM acts as a reward function, and policy model’s goal is to achieve high reward from it.

  • Direct-RLAIF (d-RLAIF)

    It’s a faster, simpler approach which is also reward-model-free to avoid the difficulty of constant retraining of the RM. The name of this method speaks for itself:

    • The policy model gets scores directly from an off-the-shelf LLM at every training step, so the outputs can be scored on the fly during training.

    • The LLM is asked to rate responses on a scale from 1 to 10.

    • These scores are turned into a probability distribution, then converted into a normalized reward between –1 and 1.

Image Credit: d-RLAIF, RLAIF original paper

As we can see there are many optimizations in RLAIF compared to RLHF, but…

Does RLAIF wins over RLHF?

Experiments show that RLAIF can match RLHF performance, outperform it on harmlessness, and reduce costs by over 10×. Just look at the results below:

Image Credit: RLAIF original paper

However, things get a lot more interesting with d-RLAIF method. Compared to RLAIF’s 71% and standard RLHF’s 73%, d-RLAIF beat Supervised Fine-Tuning (SFT) 74% of the time on summarization. d-RLAIF is also preferred 60% of the time over canonical RLAIF.

What does this mean?

As we see it, d-RLAIF is like a mixture of RLHF with AI feedback and DPO concepts, which gives it an advantage from both “styles” of human alignment optimization.

Another notable thing is that it doesn’t need a huge separate model for feedback – RLAIF works good even when the trainer and trainee are the same size. This gives a great opportunity for small models, because they can effective help themselves get better.

While RLAIF and d-RLAIF offer scalable alternatives to human feedback, this human-free approach also meets some limitations.

RLAIF’s issues

  • LLM-generated preferences may misalign with human judgment, especially on specific tasks.

  • Performance depends heavily on how prompts are phrased. Careful manual prompt engineering is needed to really get the benefits from CoT and few-shot learning.

  • RLAIF reward models can become stale, while d-RLAIF requires expensive inference.

  • While small models can benefit from RLAIF-style training, big models still give better-quality labels.

Do you have a sense of which method might suit your goals?

Conclusion: Comparison of the methods

In this overview, we’ve explored the landscape of three methods for aligning AI models with human values and preferences: Direct Preference Optimization (DPO), Reward-Rank Hindsight Fine-Tuning (RRHF), and Reinforcement Learning from AI Feedback (RLAIF + d-RLAIF). And here’s the main question – what to choose and when?

  1. DPO demonstrates top overall efficiency, showing that simplicity matters. With no need for reward model or RL, it’s the best variant when you seek for fast, stable fine-tuning with quality preferences. However, it may lack enough flexibility and doesn’t encourage exploration.

  2. RRHF trains only on comparisons and top-choice imitation. It’s highly flexible, as it can work with multiple types of responses. RRHF is best for cases where you want alignment but don’t want PPO overhead with 4 models. However, this method may struggle in online learning.

  3. RLAIF and d-RLAIF fully replace human annotators. They are especially strong with CoT prompting, perfectly aligning with the current trend of Reasoning Models. RLAIF is best for automating reward generation and training at scale, and what’s more, d-RLAIF performs surprisingly close to RLHF with much less human input. However, like all approaches that explicitly use AI, RLAIF still needs human curation.

There’s no one-size-fits-all solution. All these methods contribute to the huge family of human preference alignment methods and we just should know what we’re working with to use them where they fit best. Maybe particularly these 4 methods will encourage building something even more close to our human preferences in the future. Overall, this topic will always remain relevant because AI works with us, humans, and we need to speak the same language and share the same values.

Sources and further reading

Resources from Turing Post

Reply

Avatar

or to participate

Keep Reading