This website uses cookies

Read our Privacy policy and Terms of use for more information.

Quick answer: What is GRPO?

GRPO (Group Relative Policy Optimization) is a reinforcement learning approach used to refine model behavior by optimizing relative preference signals rather than only next-token likelihood. The practical question is not the acronym, but what signal you optimize and what breaks when you optimize it. This guide explains GRPO at a high level and gives an operator checklist to judge whether gains are meaningful or mostly benchmark noise.

Innovation doesn’t always mean starting from scratch – it often means rethinking the fundamentals. That’s the case with DeepSeek’s Group Relative Policy Optimization (GRPO), a novel twist on traditional reinforcement learning methods like PPO. GRPO powers models such as DeepSeek-R1 and DeepSeek-R1-Zero, helping to adapt reinforcement learning (RL) specifically for large language models (LLMs). As a result, it’s become one of the hottest AI topics of 2025.

GRPO was created as a strong alternative to the traditionally used Proximal Policy Optimization (PPO), which requires a value (critic) model, doubling memory and compute. GRPO gets rid of the critic model, letting a model learn from its own outputs. Its workflow makes it a more efficient and faster method – GRPO shines in reasoning-heavy tasks like math and coding, and allows models to perform long Chain-of-Thought (CoT) reasoning. After the DeepSeek-R1 breakthrough, we also witnessed a wave of GRPO implementations in many studies, and just recently an interesting development – Flow-GRPO, came out. It finally allows RL to be applied to flow models (meaning – images!).

So let’s break down what makes GRPO so special and efficient, how it works, and how it can be adapted to different modalities on the example of flow models and Flow-GPRO. Super interesting!

Follow us on 🎥 YouTube Twitter Hugging Face 🤗

In today’s episode, we will cover:

  • Why do we need GRPO? PPO limitations

  • GRPO vs PPO: Key Differences

  • How does GRPO work?

  • Why does GRPO win?

  • Implementation: DeepSeek-R1 and DeepSeek-R1-Zero

  • What is Flow-GRPO?

  • Limitations of GPRO

  • Conclusion

  • Sources and further reading

Why do we need GRPO? PPO limitations

To thoroughly understand what is GRPO, we need to look at what it rivals. In 2017, OpenAI introduced Proximal Policy Optimization (PPO), an RL algorithm used to teach models and agents how to make good decisions by interacting with an environment. It’s general goal is to optimize model and align its outputs with desired behaviors. Later PPO became classic and widely used in robotics, autonomous systems, training agents to play games like Atari and more. What’s the secret?

Stability. PPO improves agents behavior, or policy, but it does this in small steps, avoiding big, risky changes:

  • Data Collection
    The agent interacts with the environment and collects data – actions taken, states visited, and rewards received.

  • Advantage Calculation
    PPO uses a value function to estimate how good each action was. Specifically, it calculates the advantage (actual reward – expected reward).
    Generalized Advantage Estimation (GAE) helps smooth this by combining short- and long-term reward signals.

  • Policy Update via Clipped Objective
    PPO compares the new policy to the old one using a probability ratio. To ensure stability:
    ➝ It clips this ratio to stay within a trusted range (typically 0.8 to 1.2), avoiding large and risky updates. (Clipping is PPO’s main feature to keep the updates stable and under control!) If the change is too large, PPO ignores the extra gain; and if the change can hurt performance, PPO keeps the penalty.
    ➝ It may also monitor KL divergence to ensure the new policy doesn’t drift too far from the old one.

  • Joint Optimization

    PPO updates three objectives together:

    • Policy improvement via the clipped objective

    • Value function for better reward prediction

    • Entropy bonus to maintain exploration

  • Training Over Epochs

    PPO runs for several rounds, also called epochs, of training, splitting the data into mini-batches and improving the policy and value networks using gradient ascent.

  • Repeat

    As the process repeats, the model gradually gets smarter and more effective.

This workflow of PPO works really good and remains a backbone of many systems, but one serious thing makes it memory inefficient and increases the compute requirements during training. PPO is an actor-critic style algorithm, which requires training a separate critic (value) network, which is often the same size as the policy model. It also depends a lot on the quality of critic model, which can be slow to learn and not so good at generalizing. After all, in long language tasks, like step-by-step math solutions, PPO’s token-level update and value estimates can also struggle.

Combine all these aspects – and you can get a bottleneck for PPO. Now is the moment when we strive for the effective use of GPU memory, and with the trend of increasing compute during inference, we need to reduce it in other stages to balance overall efficiency.

What if we can get rid of this extra critic network and find a way to replace it with something more effective?

This idea encouraged DeepSeek to create their own alternative to PPO and other methods – Group Relative Policy Optimization (GRPO), an RL algorithm which entirely drops the critic network. GRPO was firstly introduced in DeepSeekMath paper in April 2024. Its idea is to judge model outputs relative to each other in groups rather than to absolute value estimate. This trick yields a strong learning signal for complex reasoning and removes the overhead of training an extra network and optimizes the memory usage.

GRPO vs PPO: Key Differences

So the main difference between GRPO and PPO is how they estimate advantage. PPO uses a separate critic, or value model, to estimate how good each action or token is. However, that critic is often large, expensive, increasing memory and compute cost, and hard to make reliable for long reasoning traces.

GRPO removes this whole part. It samples multiple answers to the same prompt, scores them within the same group and finds which answer is better than the group average. That relative comparison becomes the learning signal. This is why GRPO is so attractive for reasoning models: it keeps PPO’s clipped, stable update idea, but avoids training a separate value model, reducing memory pressure and making RL more practical for math, coding, and long Chain-of-Though generation.

So let’s break down what exactly makes GRPO design choice indeed more effective than PPO.

How does GRPO work?

Here is how step-by-step working process of GRPO looks like:

  1. An old policy model generates multiple answers for each question.

  2. A reward models gives a score to each answer. These scores are then normalized by subtracting the group’s average and dividing by its standard deviation – so you get an understanding of how each answer compares to others from the group. In other words, it’s a relative reward for each output.

  3. Instead of using a value function, the normalized reward simply becomes each answer’s advantage.

  4. GRPO inherits clipping mechanism from PPO. To make sure the new model doesn’t drift too far from its starting behavior, GRPO adds a KL divergence penalty but directly to the loss, not to the reward. This simplifies advantage estimation.

Image Credit: DeepSeekMath paper

GRPO adds some important tricks to stay effective:

  • It gives a score only at the end of each output.

  • It scores each reasoning step in the output.

  • That score is used as the advantage for every token in the output, and the advantage for each token is the sum of the future step scores. This encourages the model to stay on a good reasoning path.

  • Iterative GRPO: During training, the original reward model might fall behind the improved LLM, so it is retrained with new outputs from the updated LLM. 10% of old data is reused to stabilize learning. The reference model, which is used for KL penalty, is also updated along with the policy.

Overall, GRPO vs PPO is mostly about a design choice: learn a baseline with a critic (PPO) or estimate a baseline from a group of current policy samples (GRPO). But…

Why Is GRPO More Efficient Than PPO?

When firstly used to fine-tune the DeepSeekMath-Instruct 7B model that was trained only on CoT math questions, GRPO allowed the model to achieve 88.2% accuracy on GSM8K and 51.7% on MATH. DeepSeekMath-RL 7B outperformed even larger open-source models and many closed-source ones.

Image Credit: DeepSeekMath paper

Overall, GRPO shows itself as a more efficient approach than PPO thanks to its smarter workflow.

To summarize, here are GRPO’s main advantages:

  • It doesn’t need value function.

  • GRPO uses group-based rewards, which is easier to calculate.

  • It adds KL penalty to loss directly, while PPO adds it to the reward signal. Due to this, advantage estimation becomes simpler, clearer, and less error-prone.

  • As it reduces memory use and allows faster and easier training, it’s a cheaper approach.

  • GRPO is designed specifically for LLMs, which makes it highly practical.

That’s why DeepSeek continued to use GRPO algorithm in their future developments, and now we have DeepSeek-R1 as one of the best reasoning models ever, which also employs this learning approach.

Why DeepSeek Chose GRPO for DeepSeek-R1

DeepSeek-R1’s success made GRPO a hot topic, as this algorithm proved that thoroughly build RL strategy can lead to high-level reasoning capabilities of models.

DeepSeek-R1-Zero which was trained entirely with RL and with no supervised fine-tuning (SFT) showed emergent behaviors like self-reflection, re-evaluation, and long CoT reasoning. Notably, instead of using a learned reward model, GRPO in DeepSeek-R1-Zero used rule-based rewards, including:

  • Accuracy reward: Evaluates the correctness of response.

  • Format reward: Evaluates how the model follows required answer structure.

Legendary DeepSeek-R1 overcame DeepSeek-R1-Zero issues, such as poor readability and mixing of languages, with a smarter training scheme: SFT → GRPO → SFT → GRPO. Here is what each stage involves:

  • 1 stage: Cold start SFT of the model with high-quality long CoT examples with summaries.

  • 2 stage: Applying GRPO for reasoning-focused RL. The reward includes accuracy reward and language consistency rewards to discourage mixing languages in response.

  • 3 stage: DeepSeek applied rejection sampling on the model’s outputs to collect a broader supervised dataset.

  • 4 stage: Applying final GRPO round across both reasoning and general scenarios to achieve high accuracy, safety and helpfulness of the model.

Indeed this recipe brought us a top-tier open DeepSeek-R1 model, that competes with closed OpenAI-o1-1217.

Image Credit: DeepSeek-R1 original paper

However, it's not just DeepSeek using their smart GRPO algorithm. After the boom of DeepSeek-R1 (which continues since January 2025), many other researchers started to adapt GRPO in their developments more actively. Here are some of the examples:

  • Hybrid GRPO combines the strengths of PPO and GRPO. It samples multiple actions per state to estimate how good each option is like GRPO, but keeps the value function from PPO to provide a stable learning signal. It aims for the balance: less bias than PPO and less variance than GRPO.

  • Multi-Objective GRPO extends GRPO to align LMs with multiple goals, like safety, politeness, and usefulness, by using a reward model that outputs separate scores for each aspect.

  • GRPO-LEAD is more strict than GRPO, as it adds a length-based reward that prefers short, correct answers; an explicit penalty for wrong answers; and a difficulty-aware weighting, so the model learns more from hard problems. It is used specifically to make models better at solving math problems.

  • DanceGRPO adapts GRPO to visual generation tasks, including video creation. It samples multiple images or videos, scores them using human-aligned reward models, like CLIP, computes relative advantages, and updates the model using a clipped objective like PPO. DanceGRPO works even with binary or sparse rewards.

  • UnifiedReward-Think is a unified multimodal reward model that improves visual understanding and generative capabilities using explicit long CoT reasoning and GRPO. It follows a three-stage training pipeline: 1) cold start using GPT-4o-distilled CoT data; 2) rejection sampling to reinforce correct reasoning; and 3) GRPO to fine-tune reasoning on incorrect samples. GRPO here uses verifiable rewards to guide the model through diverse reasoning paths.

However, the most popular recent implementation of GRPO is in flow-matching models.

What is Flow-GRPO?

Along with diffusion models, flow matching models are great at generating high-quality images. However, flow models use a deterministic process with no randomness, which doesn’t work well with RL. Flow-GRPO, developed by a group of researchers from CUHK MMLab, Kuaishou Technology, Nanjing University and others, finally brings online RL to flow models. It works on the principle that the more images the model generates, the more it improves.

Flow models don’t naturally support the randomness that RL needs. Thinking of denoising as a step-by-step decision process, researchers could apply RL techniques to help the model learn better ways for generating clearer, more accurate images.

Flow-GRPO makes it possible to fit GRPO into flow models using two strategies:

Image Credit: Flow-GRPO original paper

  1. ODE-to-SDE conversion: It adds controlled randomness to apply RL to flow matching models.

    Generally, flow models use deterministic process, based on Ordinary Differential Equation (ODE). They follow fixed paths to generate images with no randomness, so you can’t calculate probabilities, which is needed for GRPO. Flow-GRPO transforms deterministic ODE into Stochastic Differential Equation (SDE). Here’s how it works:

    • A noise term is added to each generation step.

    • The noise is controlled and added in such way that the overall distribution of images at each time, called the marginal distribution, stays the same, but the flow model now can explore and improve using RL.

    This makes it possible to compute probabilities and KL divergence required for GRPO to work.

  2. Denoising reduction: It’s a special trick to make training faster without hurting performance. As each image requires many denoising steps, here’s a solution:

    • During training, reduce the number of steps (for example, from 40 to 10).

    • During testing and inference, use the full number of steps.

    As a result, models can train faster, using fewer resources.

Flow-GRPO was tested on text-to-image tasks and showed impressive results:

  • On GenEval, model’s accuracy jumped from 63% to 95%.

    Image Credit: Flow-GRPO original paper

  • On text rendering, it improved from 59% to 92% with KL regularization and 93% without KL, generating text in images much more clearly.

  • SD-3.5-M model using Flow-GRPO even outperformed GPT-4o in some visual benchmarks.

  • Flow-GRPO achieved 23.31 (with KL) and 23.41 (without KL) scores in preference alignment, compared to 21.7 score of original model.

  • What is interesting is that scores with KL regularization are lower than without it. Yes, without KL models are indeed a little bit more accurate and align better with human preferences but it comes at a cost of outputs diversity.

    Image Credit: Flow-GRPO original paper

  • Denoising reduction strategy led to a 4× speedup in training time.

  • Another notable moment is that using Flow GRPO avoided reward hacking – it’s when a model learns to “game” the scoring system at the cost of quality or diversity.

However, Flow-GRPO’s scaling to video generation raises some questions, such as what reward models will be effective for videos, how to efficiently optimize models for multiple objectives, like realism, smoothness, etc., and how to deal with more resource requirements. But video generation with flow models is the next stage for GRPO.

Flow-GPRO is a cool example of how GRPO can be effectively extended to models that basically don’t implement RL methods. While being efficient and demonstrating huge potential through its success with DeepSeek’s top R1 model, GRPO also has some limitations and areas where it can grow.

Meta's Code World Model also builds on GRPO, modifying it for multi-turn coding and software engineering tasks. See how CWM adapts GRPO for world modeling →

Limitations of Group Relative Policy Optimization

  • Sample inefficiency: As GRPO uses group-based advantages, many samples with below-average reward serve just to establish the baseline and receive near-zero or negative advantage. These samples provide little gradient signal but still require compute.

  • Dependency on reward model: Like other RL-based fine-tuning, GRPO is only as good as its reward function or model. If the reward model is biased, the policy will optimize for those flaws.

  • DeepSeek-R1-Zero experience shows that if used alone, GRPO might need to be balanced with other training methods to ensure the outputs remain human-friendly.

  • In RL settings where each sample outcome is expensive or time-consuming, such as interacting with a real environment or an external system, GRPO’s requirement for groups of samples per step can be less practical.

GRPO's Role in the Future of LLM Reasoning

GRPO is like a smart piece of algorithm that constructs effective learning in models. Today we’ve discussed how it works in different implementations, including the basic version proposed by DeepSeek and how it was successfully applied to one of the most sensational models, DeepSeek-R1.

The Flow-GRPO example is unique because it demonstrates how to apply reinforcement learning to models that basically are not adapted to it. If developers can do this, then GRPO can be expanded to many more fields, and we’re not at the peak of its potential yet. And one more thing – the DeepSeek-R1 case shows that GRPO works better in combination with SFT. So, it’s a good hint that maybe strategical mixing of GRPO is the key for future breakthroughs like DeepSeek-R1.

Sources and further reading

Resources from the Turing Post

Reply

Avatar

or to participate

Keep Reading