This website uses cookies

Read our Privacy policy and Terms of use for more information.

In 2026, reinforcement learning (RL) is a whole industry with a huge variation of methods that encourage AI models to reason correctly. The new landscape is shaped by GRPO (Group Relative Policy Optimization), RLVR (Reinforcement Learning with Verifiable Rewards), critic-free optimization, DPO (Direct Preference Optimization) variants, agentic policy optimization, and test-time diversity methods.

This guide maps the core baselines and the newest methods you can use to train models that reason, verify, search, and improve more efficiently.

TL;DR: Modern reasoning RL is shifting from expensive PPO-style pipelines toward cheaper, critic-free, group-relative, and preference-based methods. GRPO, DPO, DAPO, GSPO, ARPO, VPO, and newer DPO variants define the 2026 toolkit for RLVR, agentic training, and reasoning optimization.

Now to the list!

Core baselines everyone should know

GRPO

The foundation of the wave of RLVR and reasoning RL: critic-free, group-relative advantage that is cheaper than the classic PPO (Proximal Policy Optimization). By 2026, it is the central reference point. GRPO (Group Relative Policy Optimization) is a method where responses are compared within a group, without a separate value critic, which helps reduce compute costs. → Read more

DPO

DPO (Direct Preference Optimization) is already a classic “RLHF without full RL” method: DPO trains the model directly on preference pairs – chosen vs. rejected pair of responses to the same prompt. It updates the model so the chosen response becomes more likely, while keeping the model close to the original supervised fine-tuned model. Now DPO is the main offline preference optimization reference point, because it is simple, stable, and also cheaper than PPO-style RLHF. Read more

REINFORCE++

It matters as a “simple is strong again” method: it’s critic-free policy optimization that updates the model based on the reward for the full generated response, reinforcing more successful trajectories through a normalized advantage. It’s often placed next to GRPO and RLOO as a simple RLVR/RLHF baseline without PPO-level complexity. → Read more

GRPO/RLVR wave of 2026

DAPO

One of the main GRPO-successor methods. It fixes several practical issues with GRPO: DAPO (Dynamic sAmpling Policy Optimization) keeps the GRPO-style group comparison workflow, but makes training more stable by separating clipping behavior, filtering and sampling more informative prompts, and tuning several rollout-level details. DAPO scores 50 on AIME 2024 with Qwen2.5-32B, along with an open-source large-scale RL system.→ Read more

Dr. GRPO

It matters as “GRPO done right” and a fix for token efficiency: Dr. GRPO fixes GRPO’s length-related bias by correcting how advantages and normalization are computed across tokens and responses. It normalizes using a fixed maximum or completion length, so shorter answers don’t get artificially larger updates, and longer reasoning traces are not unfairly penalized. → Read more

GSPO

A very important shift to sequence-level likelihood ratios. GSPO (Group Sequence Policy Optimization) computes the importance ratio over the whole generated sequence, then clips and optimizes this sequence-level ratio so the update aligns more directly with the final response-level reward. It is especially stable for Mixture-of-Experts RL training. → Read more

DHPO

A very fresh 2026 method: DHPO (Dynamic Hybrid Policy Optimization) combines GRPO’s token-level ratios to guide local corrections and GSPO’s sequence-level importance ratio to keep the whole-response optimization aligned with the final reward. In the end, GRPO gives you fine-grained credit assignment, GSPO better matches sequence-level rewards, and DHPO tries to get the best of both. → Read more

EP-GRPO

A fresh GRPO variant. It targets some of GRPO’s credit assignment failures: uniform token-level granularity, wrong polarity on reasoning steps, and zero-variance collapse. EP-GRPO (Entropy-Progress Aligned GRPO) tracks entropy changes across reasoning steps and uses this “progress” signal to reweight token advantages, so updates focus more on tokens that actually move the solution forward instead of treating every token equally. → Read more

TR-GRPO

One more GRPO-variant, that regulates token contributions. TR-GRPO (Token-Regulated GRPO) assigns different weights to tokens based on their estimated contribution to the final reward. This reduces noisy or unhelpful token updates while preserving stronger learning signals for important reasoning/action tokens. → Read more

DPPO

It is a fresh efficiency-focused method for group-based PO. DPPO (Dynamic Pruning Policy Optimization) makes GRPO-style training faster through dynamic pruning. It prunes low-value or redundant rollouts during group-based training, then uses importance-sampling correction so the faster update still estimates the original GRPO-style gradient without bias.. → Read more

Agentic / test-time search / diversity

ARPO

Very important for agentic and tool-use models. ARPO (Agentic Reinforced Policy Optimization) proposes an RL algorithm designed specifically for multi-turn LLM agents. ARPO samples and optimizes at the agent-step level – across intermediate tool calls, observations, and decisions – and the model learns which actions improve the whole multi-turn trajectory instead of only rewarding the final answer. → Read more

VPO

This is one of the most interesting new VPO methods. VPO (Vector Policy Optimization) trains the model to produce diverse solution sets under different reward vectors, which is important for test-time search, best@k, and pass@k. → Read more

On X, we daily surface the AI research that matters and explain the ideas behind it. Follow us to be on track with the latest advancements!

Preference Optimization 2.0

InSPO

InSPO (Intrinsic Self-reflective Preference Optimization) is conceptually interesting: it brings self-reflection directly into preference optimization by conditioning the policy not only on the context, but also on an alternative response. It is a plug-and-play enhancement for DPO-family algorithms. → Read more

TI-DPO

TI-DPO (Token-Importance Guided DPO) is one of the most notable DPO variants. DPO is too coarse-grained because not all tokens matter equally. So TI-DPO introduces token-importance weights and a triplet loss, to let the model can focus more on the parts of the response that actually drive the preference. → Read more

RAPPO

A good fresh DPO variant that uses order-aware preference learning – “keep the best, forget the rest”. RAPPO (Reliable Alignment for Preference PO) ranks multiple candidate responses by preference order, keeps the strongest one as the main positive signal, and downweights or discards weaker alternatives. → Read more

If you’ve found this list valuable, please subscribe to our newsletter for free.

FAQ

What is GRPO in reinforcement learning?

GRPO, or Group Relative Policy Optimization, is a critic-free reinforcement learning method where multiple responses to the same prompt are compared within a group. Instead of training a separate value model, GRPO uses group-relative rewards to estimate advantage, making reasoning RL and RLVR cheaper than PPO-style training.

What is RLVR?

RLVR means reinforcement learning with verifiable rewards. It trains models on tasks where answers can be checked automatically, such as math, coding, logic, or structured reasoning problems. Instead of relying only on human preference labels, RLVR uses rule-based or programmatic verification to reward correct reasoning outcomes.

GRPO vs PPO: what is the difference?

PPO usually relies on a value critic to estimate advantages during reinforcement learning. GRPO removes the separate critic and compares responses within a sampled group instead. This makes GRPO simpler and often cheaper for large language model reasoning training, especially when rewards are verifiable.

What are GRPO, DPO, RLVR, DAPO, GSPO, ARPO, and VPO used for?

GRPO is used for cheaper critic-free reasoning RL; DPO for offline preference alignment; RLVR for tasks with verifiable answers like math or coding; DAPO for more stable GRPO-style training; GSPO for sequence-level rewards; ARPO for multi-turn agents and tool use; and VPO for diverse test-time search.

Why do RLVR methods matter for reasoning models?

RLVR methods matter because they help models improve on tasks with objectively checkable answers. They are central to training stronger reasoning models for math, coding, tool use, and multi-step problem solving, where the model needs not only to sound plausible but to reach a correct result.

Reply

Avatar

or to participate

Keep Reading