RLHF, or Reinforcement Learning from Human Feedback, is a training method that aligns AI models with human preferences. It collects human judgments, trains a reward model, and uses reinforcement learning to make outputs more helpful, safe, relevant, and instruction-following.
In March 2022, OpenAI introduced InstructGPT, a significant advancement over GPT-3. This model not only extended GPT-3's language abilities but also transformed its architecture to better follow instructions. InstructGPT merged GPT-3's linguistic skills with enhanced task compliance, setting a new benchmark in language model development. Its refinement involved Reinforcement Learning from Human Feedback (RLHF), which played a crucial role in improving its ability to understand and follow written instructions, forming the basis for ChatGPT's capabilities.
RLHF is a fascinating blend of human intuition and machine learning. Its roots trace back to the early experiments in reinforcement learning (RL), where the idea was simple yet profound: teach machines to learn from their environment, much like a child learns from experience. But RLHF took a groundbreaking turn by adding a critical ingredient – human feedback.
Since then, RLHF is increasingly catching the attention of experts and practitioners. They are keen to understand why it works, how it functions, and what its future applications might be. RLHF is a story of how AI started to 'listen' and 'learn' from us, opening new avenues where machines could perform complex tasks with a level of understanding previously thought impossible. Let’s take a closer look at this important technique for fine-tuning and alignment. In today’s Token:
RLHF origins – did you know it started with robotics?
Why RLHF is gaining traction
How RLHF works
What are probable use cases for RLHF?
RLHF's problems and limitations
Conclusion
Bonus resources for further reading
RLHF Origins: From Robotics to LLM Alignment
Did you know that initially, RLHF was used to guide the learning process in complex tasks like playing Atari games or simulating robot locomotion, and it was only much later that it was applied to improve text generation results by LLMs?
Its origins are closely tied to preference-based reinforcement learning (PbRL). The key innovation was deriving objectives from qualitative feedback, like pairwise preferences, rather than relying solely on numerical rewards. RLHF was essentially a synonym for this concept, focusing on learning from relative feedback. However, RLHF has evolved into a more comprehensive framework, encompassing various feedback types beyond PbRL's scope. This includes binary trajectory comparisons, rankings, state and action preferences, critique, scalar feedback, corrections, action advice, implicit feedback, and natural language inputs. This expanded scope positions RLHF as a generalized approach, adaptable to diverse scenarios where human feedback shapes reinforcement learning goals.
A pivotal moment in RLHF's development was the introduction of methods that train reinforcement learning systems using human preferences instead of predefined reward functions. Notably, joint research in 2017 by OpenAI and DeepMind, with contributions from Dario Amodei, the creator of Anthropic, exemplified this advancement. Their work showed that even less than 1% of human interactions could effectively guide complex tasks like playing Atari games and simulating robot locomotion.
RLHF became a real talk of the town when it enabled the transition from GPT-3 to ChatGPT (first implemented in InstructGPT and explained in the paper Training language models to follow instructions with human feedback.)
Why RLHF Matters: Role in LLM Training
Modern transformer-based LLMs like ChatGPT undergo three main training stages:
Pretraining: Learning from vast, unlabeled text datasets using next-word prediction.
Supervised fine-tuning: Refining responses with specific instruction-output pairs.
Alignment: Using RLHF to align the model with human preferences and safety guidelines.
The effectiveness of RLHF lies in its personalized responses and continuous learning, allowing AI models to stay updated with changing conditions and user feedback. This is especially beneficial in dynamic fields like fraud detection, where new patterns must be identified and adapted to quickly.
However, RLHF faces challenges such as ensuring consistent and unbiased human feedback, and the potential for biases inherent in human judgments. Additionally, RLHF models may have limitations in generalizing beyond specific scenarios encountered during training, which can affect their performance in new situations.
Despite these challenges, RLHF is poised for growth, particularly in its combination with LLMs. Ongoing research aims to address its limitations, enhancing the reliability and effectiveness of RLHF in various applications.
How RLHF works
For simplicity, Sebastian Raschka suggests to look at the RLHF pipeline in three separate steps:
RLHF Step 1: Supervised Fine-tuning of the Pretrained Model
Objective: The first step involves adapting a pretrained model to more closely align with specific tasks or objectives. This is done through supervised learning, where the model is trained with a dataset consisting of input-output pairs that represent desired responses.
Process: The pretrained model, which has already learned general language patterns and context from extensive datasets, is further trained on a smaller, task-specific dataset. This dataset consists of examples that are more closely related to the final application of the model.
Outcome: The result is a fine-tuned model that is better at handling specific types of prompts or tasks, as it has been trained on examples that are representative of the kind of output desired in its eventual application.
RLHF Step 2: Creating a Reward Model
Objective: This step involves developing a model that can assign a reward (or score) to the outputs generated by the fine-tuned model. The reward reflects how well an output aligns with human preferences or objectives.
Process: To create this model, human feedback is collected. This could involve humans rating the quality of outputs, choosing between pairs of outputs, or providing more nuanced feedback. The collected feedback is then used to train the reward model, teaching it to predict the quality of model outputs based on human preferences.
Outcome: The reward model serves as a critical component that guides the AI in generating outputs that are more aligned with human expectations and preferences.
RLHF Step 3: Fine-tuning via Proximal Policy Optimization (PPO)
Process: PPO is an algorithm used to update the model's parameters in a way that maximizes expected rewards while keeping the updates relatively stable. This involves the model generating outputs, receiving scores from the reward model, and using these scores to update its parameters in a way that is more likely to produce high-reward outputs in the future.
Outcome: The use of PPO ensures that the model not only aligns with human feedback but also maintains a balance between exploiting known successful strategies and exploring new potential strategies. This results in a robust model that generates high-quality, human-aligned outputs.
That scheme is mainly for GPT models. If we speak about Llama-2-chat, it initially follows InstructGPT's supervised fine-tuning approach. However, it differs by developing two reward models in RLHF Step 2 and incorporates stages of evolution and rejection sampling for continuous improvement.
RLHF Use Cases: NLP, Robotics, Multimodal AI
According to Nathan Lambert (one of our favorite experts on RLHF), RLHF helps steer large models in a human-aligned way when vanilla supervised learning fails, but scaling it remains challenging. Nonetheless, RLHF offers a diverse range of applications, enhancing AI capabilities in various domains. Here's a summary of where the RLHF concept might make a difference:
Robotics: It is used to teach robots human-like behavior, supporting complex tasks through interactive and playful interactions.
Video Game Development: It's instrumental in enhancing game agents’ capabilities and improving their actions with human feedback.
Natural Language Processing (NLP): RLHF refines email response generation, text summarization, and conversational agents, making language models more attuned to complex human values.
Multimodal RLHF Applications:
Data Balancing: Effectively integrates language and image data to avoid overfitting to text.
Multimodal Chat and Visual Language Models (VLMs): Adapts existing datasets for VLMs, exploring their future use.
Addressing Challenges in Multimodal Datasets: Focuses on the complexity of creating these datasets, including safety considerations.
Multi-Objective RLHF: Investigates multi-objective optimization in multimodal contexts, balancing factors like helpfulness, factuality, and safety.
Research and Resources: Utilizes insights from studies on multimodal interactive agents and preference distillation in VLMs.
Synthetic Data Usage: Innovatively employs synthetic data for cost-effective multimodal system development.
Domain-Specific Model Improvement:
Fine-tuning LLMs: RLHF refines models like ChatGPT, enhancing their ability to capture nuances in dialogue style.
Domain-Specific Tasks: It improves models for specific tasks, such as code generation, by defining computational reward functions.
LLM Evaluation: Assesses models like GPT-4 for conversational quality, though challenges in training signals remain.
Quality Training Data Collection: Iteratively collects human assessments to build and evolve a reward model for dialogue.
Customized Self-Hosted Models: Enables the fine-tuning of general LLMs for niche applications, balancing against large commercial models.
Overcoming Instruction Fine-Tuning Limitations: Helps better sample potential interaction spaces, inducing varied model behaviors.
These use cases highlight RLHF's versatility in not only improving AI performance in language processing and interactive robotics but also in pioneering multimodal applications and specific domain enhancements.
RLHF's problems and limitations
A key challenge in RLHF is its impact on users, including potential biases and moral implications. Additionally, gathering high-quality human feedback data is costly and labor-intensive, often requiring external human resources. The open problems and fundamental limitations of Reinforcement Learning from Human Feedback (RLHF) as identified in the paper Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback are grouped into three main categories:
Challenges in AI Feedback include misaligned human evaluators, difficulty in oversight, evaluating complex tasks, potential for misleading evaluations, data quality issues, and limitations in feedback types. The Reward Model poses challenges like problem misspecification, reward misgeneralization, and evaluating reward models. Policy challenges involve robust reinforcement learning, misgeneralization, power-seeking by agents, and distributional issues.
Conclusion
RLHF is like a powerful tool that's still being fine-tuned. So far it has been crucial for the success of current LLMs, but it's far from perfect. The big question now is how RLHF will grow, especially in areas beyond just text. We're seeing new methods that might train these models more efficiently. Yet, as RLHF evolves, it faces challenges like privacy, biases, and security risks.
Right now, making RLHF work for multimodal applications (like combining text and images) is still in its early stage. It needs more development before it can match what we've achieved with text. So, RLHF is a key part of AI's future, but there's a lot of room to grow and improve.
We will look into RLHF alternatives in one of the next Tokens. Stay tuned.
Bonus resources for further reading
Turing Post
Please give us feedback
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
Previously in the FM/LLM series:
FAQ
What is the purpose of RLHF?
The purpose of RLHF is to make AI models better aligned with human expectations. RLHF teaches models to prefer responses that humans judge as more helpful, accurate, safe, natural, or useful for a specific task, instead of optimizing only for next-token prediction,
What is RLHF in ChatGPT?
In ChatGPT, RLHF is one of the alignment methods used to make a pretrained language model follow instructions and produce more human-preferred answers. Human feedback helps train a reward model, and reinforcement learning then fine-tunes the model toward responses that better match user intent.
What is the difference between RL and RLHF?
Reinforcement learning trains an agent to maximize rewards from an environment. RLHF is a version of reinforcement learning where the reward signal comes from human feedback, such as rankings, comparisons, ratings, corrections, or preferences. So in short, RL learns from rewards; RLHF learns from human-guided rewards.
Can RLHF be automated?
RLHF can be partly automated, but not fully replaced by automation in most high-quality alignment pipelines. AI systems can generate feedback, rank outputs, or assist evaluation, but human judgment is still important for complex preferences, safety, nuance, and values that are hard to capture with automatic rules alone.










