AI 101: "On-Policy Distillation Zeitgeist"

Self-distillation isn’t new, but in 2026 it matters in a way it didn’t before. Models have reached a point where they can judge and rationalize solutions sometimes better than they can generate them from scratch. And the training landscape is also changing.

Earlier, in many cases we’ve relied on knowledge distillation – transferring knowledge and capabilities from large models to smaller ones – as well as RL-based post-training loops. But both approaches are quite expensive and compute-heavy. So we should move on and use what current models now can do well – and they are powerful enough to increasingly improve by comparing its own reasoning against a privileged, better version of itself and they are powerful in in-context learning. At the same time, scaling costs, continual updates, and synthetic data pipelines have made efficient post-training and test-time deployment a central task.

In light of all this, today we’re talking about self-distillation as a scalable way to refine reasoning trajectories and upgrade model behavior using its own judgments, offering a middle path between the two main corners: supervised fine-tuning (SFT) and reinforcement learning (RL): on-policy updates, dense feedback, no explicit reward model.

Fortunately, we have three really interesting works on how to make on-policy distillation shine:

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models which shows how one more and explicitly self-critique on the full reasoning paths.
Self-Distillation Enables Continual Learning exploring how continual learning can gain from self-distillation.
Reinforcement Learning via Self-Distillation which focuses on the power of feedback.

It’s a really interesting journey into what is happening right now in self-distillation, so let’s move on to see the main focus areas.

_{Note: Thanks to Kevin Murphy for coining the phrase “On-Policy Distillation Zeitgeist” in a tweet.}

In today’s episode, we will cover:

Broadening your knowledge of distillation
What is on-policy self-distillation?
Self-distillation in the field of continual learning
Self-distillation policy optimization, or how to get the maximum out of feedback
Conclusion: What it all teaches us
Sources and further reading

Broadening your knowledge of distillation

First of all, let’s look at what main options we usually have when it comes to model training.

The classic one is knowledge distillation (which we discussed in details in one of our episodes earlier) – a way to train a smaller or weaker language model (the student) through imitating a stronger one (the teacher). This works well for the student model because it doesn’t learn just what is correct – it also learns from the teacher’s probabilities over words, exploring how confident the teacher is in different options. That extra information helps the student learn better patterns. Overall, training comes down to making the student’s predictions match the teacher’s predictions at every step.

Knowledge distillation is typically implemented as off-policy training. The form we know best is supervised fine-tuning (SFT), where the model learns to imitate expert outputs from a fixed dataset.

But this setup becomes fragile when the model must generate long reasoning chains at inference time. Classic distillation is trained on a fixed dataset of idealized reasoning paths, while at deployment the model generates its own imperfect trajectories. Small early mistakes can snowball, creating the well-known gap between training and inference – often referred to as distribution mismatch or exposure bias. That’s why there is another option – on-policy distillation.

On-policy distillation is much closer to realistic learning process: you try → your teacher corrects you → you improve. It lets the student generate its own answers during training. Then, the teacher evaluates these sequences, and the student gets token-by-token feedback on what it should have done instead. It is a combination of the realism of reinforcement learning and supervised learning with clear, dense feedback.

And another thing that is gaining most of traction right now is Reinforcement Learning with Verifiable Rewards (RLVR) – all those methods like GRPO (Group Relative Policy Optimization) or PPO-style (Proximal Policy Optimization) algorithms, where models’ answers are automatically checked if they are correct or incorrect and rewarded based on this. But the problem comes with this approach as well, because most methods evaluate only the final answer, ignoring mistakes in the middle steps. Rewards are binary (right/wrong) in general, and there is a risk to waste compute in case of all the answers are all correct or incorrect, because there is simply to learning signal there.

Yes, there is also an option to give step-by-step rewards during reasoning, for example, using Process reward models (PRMs), but they require manual labeling of reasoning steps, which is expensive and hard to scale.

So, that’s why we need to change something in our approach to knowledge distillation, and the many shifts are now coming in self-distillation particularly – it gives the opportunity to look inside the reasoning steps and give the most detailed feedback.

What is On-Policy Self-Distillation?

Basically, for models the best combination would be to get all the good parts at once:

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying →

UPGRADE TO READ THE REST

Join Premium members from top companies like Microsoft, Nvidia, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI.

AI 101: "On-Policy Distillation Zeitgeist"

Broadening your knowledge of distillation

What is On-Policy Self-Distillation?

Reply

Keep Reading

Turing Post