- Turing Post
- Posts
- 11 New Interesting Policy Optimization Techniques
11 New Interesting Policy Optimization Techniques
Policy optimization is one of the most exciting topics for the AI community right now. Why? Mostly because of reinforcement learning (RL) popularity. Policy optimization is the way to apply RL – you directly train the policy (the model’s behavior) using rewards. The most widely-used policy optimization methods are PPO (Proximal Policy Optimization), GRPO (Group Relative Policy Optimization), and their extensions, and there is always much to learn because a lot of new approaches appear almost every week.
So let’s discover 11 quiet recent policy optimization methods:
Group reward-Decoupled Normalization Policy Optimization (GDPO) by NVIDIA keeps multiple rewards separate during learning, preventing them from collapsing, and helping models learn user preferences more accurately and stably. → Read more
Agentic Turn-based Policy Optimization via Tree Search (AT²PO) from Tencent is a way to train AI agents on multi-step tasks by organizing decisions into turns, exploring options with tree search, and assigning rewards per turn so learning becomes more stable and effective. → Read more
Bottom-up Policy Optimization (BuPO) applies reinforcement learning directly to lower Transformer layers, using their intermediate token distributions to shape early exploration and reasoning, rather than only optimizing final outputs. → Read more
VA-π (Variational Policy Alignment for Pixel-Aware Autoregressive Generation) is a post-training method that aligns autoregressive image generators with pixel quality. It treats generation as a policy and uses image reconstruction quality as a direct reinforcement learning reward. → Read more
Puzzle Curriculum GRPO (PC-GRPO) a way to train vision-language models using self-supervised visual puzzles instead of labels, with a difficulty-based curriculum that gives more stable rewards and helps models reason more consistently. → Read more
Turn-PPO (Turn Proximal Policy Optimization) by Amazon – a PPO-based training method for multi-turn AI agents. It treats each interaction turn as a decision step, improving stability and long-horizon reasoning compared to token-level or GRPO-based approaches. → Read more
Momentum-Anchored Group Relative Policy Optimization (M-GRPO) stabilizes self-supervised RL by anchoring the policy update with a momentum model and filtering low-entropy trajectories to avoid premature collapse. → Read more
Tool-Augmented Policy Optimization (TAPO) integrates reasoning and adaptive tool usage in one RL framework, enabling models to interleave reasoning tokens with tool calls during optimization. → Read more
Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO) combines dense curriculum rewards, that guide models from correct tool usage to high-quality answers, and value-based sampling which stabilizes training by focusing updates on informative samples and avoiding flat or collapsed rewards. → Read more
Distributional Value Modeling-based Policy Optimization (DVPO) uses distributional token-level value estimates and risk-aware regularization to improve robustness and generalization under noisy supervision. → Read more
Instruction-Policy co-evolution (INSPO) optimizes both the agent’s policy and the instructions it uses, evolving instructions dynamically with policy learning for better multi-turn reasoning. → Read more
Reply