In 2026, reinforcement learning (RL) is a whole industry with a huge variation of methods that encourage AI models to reason correctly. The new landscape is shaped by GRPO (Group Relative Policy Optimization), RLVR (Reinforcement Learning with Verifiable Rewards), critic-free optimization, DPO (Direct Preference Optimization) variants, agentic policy optimization, and test-time diversity methods.
This guide maps the core baselines and the newest methods you can use to train models that reason, verify, search, and improve more efficiently.
TL;DR: Modern reasoning RL is shifting from expensive PPO-style pipelines toward cheaper, critic-free, group-relative, and preference-based methods. GRPO, DPO, DAPO, GSPO, ARPO, VPO, and newer DPO variants define the 2026 toolkit for RLVR, agentic training, and reasoning optimization.
Now to the list!
Core baselines everyone should know
GRPO
The foundation of the wave of RLVR and reasoning RL: critic-free, group-relative advantage that is cheaper than the classic PPO (Proximal Policy Optimization). By 2026, it is the central reference point. GRPO (Group Relative Policy Optimization) is a method where responses are compared within a group, without a separate value critic, which helps reduce compute costs. → Read more
DPO
DPO (Direct Preference Optimization) is already a classic “RLHF without full RL” method: DPO trains the model directly on preference pairs – chosen vs. rejected pair of responses to the same prompt. It updates the model so the chosen response becomes more likely, while keeping the model close to the original supervised fine-tuned model. Now DPO is the main offline preference optimization reference point, because it is simple, stable, and also cheaper than PPO-style RLHF. → Read more
REINFORCE++
It matters as a “simple is strong again” method: it’s critic-free policy optimization that updates the model based on the reward for the full generated response, reinforcing more successful trajectories through a normalized advantage. It’s often placed next to GRPO and RLOO as a simple RLVR/RLHF baseline without PPO-level complexity. → Read more
GRPO/RLVR wave of 2026
DAPO
One of the main GRPO-successor methods. It fixes several practical issues with GRPO: DAPO (Dynamic sAmpling Policy Optimization) keeps the GRPO-style group comparison workflow, but makes training more stable by separating clipping behavior, filtering and sampling more informative prompts, and tuning several rollout-level details. DAPO scores 50 on AIME 2024 with Qwen2.5-32B, along with an open-source large-scale RL system.→ Read more
Dr. GRPO
It matters as “GRPO done right” and a fix for token efficiency: Dr. GRPO fixes GRPO’s length-related bias by correcting how advantages and normalization are computed across tokens and responses. It normalizes using a fixed maximum or completion length, so shorter answers don’t get artificially larger updates, and longer reasoning traces are not unfairly penalized. → Read more
GSPO
A very important shift to sequence-level likelihood ratios. GSPO (Group Sequence Policy Optimization) computes the importance ratio over the whole generated sequence, then clips and optimizes this sequence-level ratio so the update aligns more directly with the final response-level reward. It is especially stable for Mixture-of-Experts RL training. → Read more
DHPO
A very fresh 2026 method: DHPO (Dynamic Hybrid Policy Optimization) combines GRPO’s token-level ratios to guide local corrections and GSPO’s sequence-level importance ratio to keep the whole-response optimization aligned with the final reward. In the end, GRPO gives you fine-grained credit assignment, GSPO better matches sequence-level rewards, and DHPO tries to get the best of both. → Read more
EP-GRPO
A fresh GRPO variant. It targets some of GRPO’s credit assignment failures: uniform token-level granularity, wrong polarity on reasoning steps, and zero-variance collapse. EP-GRPO (Entropy-Progress Aligned GRPO) tracks entropy changes across reasoning steps and uses this “progress” signal to reweight token advantages, so updates focus more on tokens that actually move the solution forward instead of treating every token equally. → Read more
TR-GRPO
One more GRPO-variant, that regulates token contributions. TR-GRPO (Token-Regulated GRPO) assigns different weights to tokens based on their estimated contribution to the final reward. This reduces noisy or unhelpful token updates while preserving stronger learning signals for important reasoning/action tokens. → Read more
DPPO
It is a fresh efficiency-focused method for group-based PO. DPPO (Dynamic Pruning Policy Optimization) makes GRPO-style training faster through dynamic pruning. It prunes low-value or redundant rollouts during group-based training, then uses importance-sampling correction so the faster update still estimates the original GRPO-style gradient without bias.. → Read more
Agentic / test-time search / diversity
ARPO
Very important for agentic and tool-use models. ARPO (Agentic Reinforced Policy Optimization) proposes an RL algorithm designed specifically for multi-turn LLM agents. ARPO samples and optimizes at the agent-step level – across intermediate tool calls, observations, and decisions – and the model learns which actions improve the whole multi-turn trajectory instead of only rewarding the final answer. → Read more
VPO
This is one of the most interesting new VPO methods. VPO (Vector Policy Optimization) trains the model to produce diverse solution sets under different reward vectors, which is important for test-time search, best@k, and pass@k. → Read more
On X, we daily surface the AI research that matters and explain the ideas behind it. Follow us to be on track with the latest advancements!
Preference Optimization 2.0
InSPO
InSPO (Intrinsic Self-reflective Preference Optimization) is conceptually interesting: it brings self-reflection directly into preference optimization by conditioning the policy not only on the context, but also on an alternative response. It is a plug-and-play enhancement for DPO-family algorithms. → Read more
TI-DPO
TI-DPO (Token-Importance Guided DPO) is one of the most notable DPO variants. DPO is too coarse-grained because not all tokens matter equally. So TI-DPO introduces token-importance weights and a triplet loss, to let the model can focus more on the parts of the response that actually drive the preference. → Read more
RAPPO
A good fresh DPO variant that uses order-aware preference learning – “keep the best, forget the rest”. RAPPO (Reliable Alignment for Preference PO) ranks multiple candidate responses by preference order, keeps the strongest one as the main positive signal, and downweights or discards weaker alternatives. → Read more
If you’ve found this list valuable, please subscribe to our newsletter for free.
FAQ
What is GRPO in reinforcement learning?
GRPO, or Group Relative Policy Optimization, is a critic-free reinforcement learning method where multiple responses to the same prompt are compared within a group. Instead of training a separate value model, GRPO uses group-relative rewards to estimate advantage, making reasoning RL and RLVR cheaper than PPO-style training.
What is RLVR?
RLVR means reinforcement learning with verifiable rewards. It trains models on tasks where answers can be checked automatically, such as math, coding, logic, or structured reasoning problems. Instead of relying only on human preference labels, RLVR uses rule-based or programmatic verification to reward correct reasoning outcomes.
GRPO vs PPO: what is the difference?
PPO usually relies on a value critic to estimate advantages during reinforcement learning. GRPO removes the separate critic and compares responses within a sampled group instead. This makes GRPO simpler and often cheaper for large language model reasoning training, especially when rewards are verifiable.
What are GRPO, DPO, RLVR, DAPO, GSPO, ARPO, and VPO used for?
GRPO is used for cheaper critic-free reasoning RL; DPO for offline preference alignment; RLVR for tasks with verifiable answers like math or coding; DAPO for more stable GRPO-style training; GSPO for sequence-level rewards; ARPO for multi-turn agents and tool use; and VPO for diverse test-time search.
Why do RLVR methods matter for reasoning models?
RLVR methods matter because they help models improve on tasks with objectively checkable answers. They are central to training stronger reasoning models for math, coding, tool use, and multi-step problem solving, where the model needs not only to sound plausible but to reach a correct result.

