Token 40: What is GRPO and Flow-GRPO?

we explore very clever algorithm behind DeepSeek-R1's breakthrough and how its latest adoption, Flow-GRPO make RL real for flow models

Innovation doesn’t always mean starting from scratch – it often means rethinking the fundamentals. That’s the case with DeepSeek’s Group Relative Policy Optimization (GRPO), a novel twist on traditional reinforcement learning methods like PPO. GRPO powers models such as DeepSeek-R1 and DeepSeek-R1-Zero, helping to adapt reinforcement learning (RL) specifically for large language models (LLMs). As a result, it’s become one of the hottest AI topics of 2025.

GRPO was created as a strong alternative to the traditionally used Proximal Policy Optimization (PPO), which requires a value (critic) model, doubling memory and compute. GRPO gets rid of the critic model, letting a model learn from its own outputs. Its workflow makes it a more efficient and faster method – GRPO shines in reasoning-heavy tasks like math and coding, and allows models to perform long Chain-of-Thought (CoT) reasoning. After the DeepSeek-R1 breakthrough, we also witnessed a wave of GRPO implementations in many studies, and just recently an interesting development – Flow-GRPO, came out. It finally allows RL to be applied to flow models (meaning – images!).

So let’s break down what makes GRPO so special and efficient, how it works, and how it can be adapted to different modalities on the example of flow models and Flow-GPRO. Super interesting!

Follow us on 🎥 YouTube Twitter  Hugging Face 🤗

In today’s episode, we will cover:

  • Why do we need GRPO? PPO limitations

  • How does GRPO work?

  • Why does GRPO win?

  • Implementation: DeepSeek-R1 and DeepSeek-R1-Zero

  • What is Flow-GRPO?

  • Limitations of GPRO

  • Conclusion

  • Sources and further reading

Why do we need GRPO? PPO limitations

To thoroughly understand what is GRPO, we need to look at what it rivals. In 2017, OpenAI introduced Proximal Policy Optimization (PPO), an RL algorithm used to teach models and agents how to make good decisions by interacting with an environment. It’s general goal is to optimize model and align its outputs with desired behaviors. Later PPO became classic and widely used in robotics, autonomous systems, training agents to play games like Atari and more. What’s the secret?

Stability. PPO improves agents behavior, or policy, but it does this in small steps, avoiding big, risky changes:

  • Data Collection
    The agent interacts with the environment and collects data – actions taken, states visited, and rewards received.

  • Advantage Calculation
    PPO uses a value function to estimate how good each action was. Specifically, it calculates the advantage (actual reward – expected reward).
    âžť Generalized Advantage Estimation (GAE) helps smooth this by combining short- and long-term reward signals.

  • Policy Update via Clipped Objective
    PPO compares the new policy to the old one using a probability ratio. To ensure stability:
    ➝ It clips this ratio to stay within a trusted range (typically 0.8 to 1.2), avoiding large and risky updates. (Clipping is PPO’s main feature to keep the updates stable and under control!) If the change is too large, PPO ignores the extra gain; and if the change can hurt performance, PPO keeps the penalty.
    ➝ It may also monitor KL divergence to ensure the new policy doesn’t drift too far from the old one.

  • Joint Optimization

    PPO updates three objectives together:

    • Policy improvement via the clipped objective

    • Value function for better reward prediction

    • Entropy bonus to maintain exploration

  • Training Over Epochs

    PPO runs for several rounds, also called epochs, of training, splitting the data into mini-batches and improving the policy and value networks using gradient ascent.

  • Repeat

    As the process repeats, the model gradually gets smarter and more effective.

This workflow of PPO works really good and remains a backbone of many systems, but one serious thing makes it memory inefficient and increases the compute requirements during training. PPO is an actor-critic style algorithm, which requires training a separate critic (value) network, which is often the same size as the policy model. It also depends a lot on the quality of critic model, which can be slow to learn and not so good at generalizing. After all, in long language tasks, like step-by-step math solutions, PPO’s token-level update and value estimates can also struggle.

Combine all these aspects – and you can get a bottleneck for PPO. Now is the moment when we strive for the effective use of GPU memory, and with the trend of increasing compute during inference, we need to reduce it in other stages to balance overall efficiency.

What if we can get rid of this extra critic network and find a way to replace it with something more effective?

This idea encouraged DeepSeek to create their own alternative to PPO and other methods – Group Relative Policy Optimization (GRPO), an RL algorithm which entirely drops the critic network. GRPO was firstly introduced in DeepSeekMath paper in April 2024. Its idea is to judge model outputs relative to each other in groups rather than to absolute value estimate. This trick yields a strong learning signal for complex reasoning and removes the overhead of training an extra network and optimizes the memory usage.

So let’s break down why GRPO design choice is indeed more effective than PPO.

How does GRPO work?

Join Premium members from top companies like Microsoft, Google, Hugging Face, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Simplify your learning journey →

Reply

or to participate.