Turing Post
Posts
Token 1.16: Understanding RLHF and its use cases

Token 1.16: Understanding RLHF and its use cases

Deep dive into a groundbreaking alignment technique with a lot of potential

Ksenia Se
January 10, 2024

In March 2022, OpenAI introduced InstructGPT, a significant advancement over GPT-3. This model not only extended GPT-3's language abilities but also transformed its architecture to better follow instructions. InstructGPT merged GPT-3's linguistic skills with enhanced task compliance, setting a new benchmark in language model development. Its refinement involved Reinforcement Learning from Human Feedback (RLHF), which played a crucial role in improving its ability to understand and follow written instructions, forming the basis for ChatGPT's capabilities.

RLHF is a fascinating blend of human intuition and machine learning. Its roots trace back to the early experiments in reinforcement learning (RL), where the idea was simple yet profound: teach machines to learn from their environment, much like a child learns from experience. But RLHF took a groundbreaking turn by adding a critical ingredient – human feedback.

Since then, RLHF is increasingly catching the attention of experts and practitioners. They are keen to understand why it works, how it functions, and what its future applications might be. RLHF is a story of how AI started to 'listen' and 'learn' from us, opening new avenues where machines could perform complex tasks with a level of understanding previously thought impossible. Let’s take a closer look at this important technique for fine-tuning and alignment. In today’s Token:

RLHF origins – did you know it started with robotics?
Why RLHF is gaining traction
How RLHF works
What are probable use cases for RLHF?
RLHF's problems and limitations
Conclusion
Bonus resources for further reading

RLHF origins

Did you know that initially, RLHF was used to guide the learning process in complex tasks like playing Atari games or simulating robot locomotion, and it was only much later that it was applied to improve text generation results by LLMs?

Its origins are closely tied to preference-based reinforcement learning (PbRL). The key innovation was deriving objectives from qualitative feedback, like pairwise preferences, rather than relying solely on numerical rewards. RLHF was essentially a synonym for this concept, focusing on learning from relative feedback. However, RLHF has evolved into a more comprehensive framework, encompassing various feedback types beyond PbRL's scope. This includes binary trajectory comparisons, rankings, state and action preferences, critique, scalar feedback, corrections, action advice, implicit feedback, and natural language inputs. This expanded scope positions RLHF as a generalized approach, adaptable to diverse scenarios where human feedback shapes reinforcement learning goals.

A pivotal moment in RLHF's development was the introduction of methods that train reinforcement learning systems using human preferences instead of predefined reward functions. Notably, joint research in 2017 by OpenAI and DeepMind, with contributions from Dario Amodei, the creator of Anthropic, exemplified this advancement. Their work showed that even less than 1% of human interactions could effectively guide complex tasks like playing Atari games and simulating robot locomotion.

RLHF became a real talk of the town when it enabled the transition from GPT-3 to ChatGPT (first implemented in InstructGPT and explained in the paper Training language models to follow instructions with human feedback.)

Why RLHF is gaining traction

The rest is available to our Premium users only →

Please give us feedback

Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍

Previously in the FM/LLM series:

Reply

or to participate.