Reinforcement Learning (RL) still remains the main strategy for post-training, that’s why it is important to advance research in RL to improve it. Now the field is converging toward rich feedback RL instead of scalar reward RL. Modern methods use a whole spectrum of signals as a reward: critiques, comparisons, checklists, verifier signals, community signals, multi-agent judgments. There are three widespread RL approaches that knows almost everyone:

  • RLHF (Reinforcement Learning from Human Feedback) – classic alignment method where humans rate model outputs, and the model is trained to match human preferences using a learned reward model.

  • RLAIF (Reinforcement Learning from AI Feedback) – same pipeline as RLHF, but replaces human judgments with AI evaluators for faster, cheaper and more scalable training.

  • RLVR (Reinforcement Learning with Verifiable Rewards) – Uses objective, automatically checkable rewards (like math correctness, unit tests) instead of preferences, helping models improve reasoning by optimizing for provably correct outcomes.

But beyond them, there are equally interesting approaches, and some of them even have similar names. To help you not get lost in all these abbreviations, today we’ll go through 13 new and interesting RL approaches:

  1. RLCF – Reinforcement Learning from Community Feedback

    A fresh papers called “AI Can Learn Scientific Taste” proposed to train AI using signals from the scientific community (like citation counts, upvotes, adoption) to learn what good research ideas look like. It works in two steps: first, a model learns to judge better vs. worse ideas; then another model is trained via rewards to generate high-impact research ideas aligned with that judgment. → Read more

    Interestingly, an approach with this name was proposed about half a year earlier. It has the same idea, but is based on Community Notes principle: a diverse group humans rates which notes are helpful, and these ratings are used as signals to train the model. Over time, the AI learns to produce more accurate, unbiased, and useful content aligned with community judgment. → Read more

  2. RLCF – Reinforcement Learning from Checklist Feedback

    Another method with the same abbreviation, but here the “C” stands for “checklist.” AI learns by checking how well a model/agent’s answers meet specific instruction-based criteria. It uses detailed checklists derived from the task. Each item is scored, combined into a reward and used to train the model to better follow complex instructions. → Read more

  3. CM2

    If RLCF is a checklist-based RL for instruction following, CM2 is a checklist-based RL for multi-step agent behavior. It uses checklists of small criteria to evaluate each step of agent’s behavior, which turn complex tasks into binary yes/no judgments. This approach emphasizes stable evaluation in complex environments. → Read more

  4. Critique-RL

    Trains a separate critic model in a two-player loop (actor + critic). The critic learns to give detailed feedback, improving both judgment quality (discriminability) and usefulness through a two-stage RL process. → Read more

  5. CRL – Critique Reinforcement Learning

    Again, we have similar names, but CRL trains a single model to produce binary judgments (True/False) about solutions. The reward is whether the judgment matches ground truth, so compared to Critique-RL it focuses more on correct evaluation, not rich feedback. → Read more

  6. ICRL – In-Context Reinforcement Learning

    Models learn to use tools directly through RL, without supervised fine-tuning. Instead of labeled data, it teaches tool use via examples inside prompts during training. Over time, these examples are removed, so the model learns to act independently. → Read more

  7. RLBF – Reinforcement Learning with Backtracking Feedback
    Here, the reward/feedback is not based on things like “which answer is better,” but on the model’s ability to backtrack when it detects a safety violation or incorrect generation. A critic provides signals like “backtrack by x tokens,” and the model learns to correct its own output on the fly. This is not preference learning like RLHF, but RL based on real-time corrective signals during generation, particularly for improving safety and robustness. → Read more

  8. TriPlay-RL – Tri-Role Self-Play Reinforcement Learning

    A framework with three roles – attacker (creates harmful prompts), defender (generates safe responses), and evaluator (judges outputs) – that are trained in a closed loop with almost no manual annotation. Essentially, this is RL through the co-evolution of roles rather than a fixed reward model. It’s particularly interesting as a new direction in safety-oriented post-training. → Read more

  9. SPIRAL

    This is RL through an environment with verifiable rewards, where supervision comes from the game dynamics instead of external feedback. Here, the LLM learns through self-play in multi-turn, zero-sum games, automatically generating a curriculum of increasingly stronger opponents. → Read more

  10. Co-rewarding

    A self-supervised RL method where a model learns from multiple complementary feedback signals instead of just one. For example, it checks consistency across similar questions or compares itself to a slowly updated “teacher” version. This makes it harder for the model to do reward hacking and helps it learn more stable, reliable reasoning. → Read more

  11. RESTRAIN

    An RL-style approach where a model learns without gold labels by using its own answer distribution as feedback. It penalizes overconfident or inconsistent rollouts while keeping promising reasoning chains. → Read more

  12. PRL – Process Reward Learning

    Breaks a single outcome reward into structured intermediate rewards, guiding the model during the reasoning process. This leads to more efficient learning and better exploration, accuracy and depth of reasoning. → Read more

  13. RLSF – Reinforcement Learning from Self-Feedback

    Uses feedback from the model itself – its internal confidence signal. The model generates multiple solutions, ranks them based on confidence, and learns from the best ones without human labels. → Read more

Also, subscribe to our X, Threads and BlueSky

to get unique content on every social media

Reply

Avatar

or to participate

Keep Reading