Deep research, deep thinking, and deep analysis are what make AI more capable today. Since much compute power has been wasted on expanding reasoning, researchers began to consider the flexibility and customizable use of compute budgets and memory across different tasks and tokens.
Here is something quite new from KAIST, Google, Mila, and Université de Montréal that goes truly deeper into this “thinking depth” concept and meets the requirements of novel efficient systems. Their creation is called Mixture-of-Recursions (MoR).
The main feature of MoR lies in the optimal reuse of layers, so each token goes through exactly as much processing as it needs. With a customizable combination of two types of routing mechanisms and two types of KV (key-value) caching, it provides a solid tech stack for our familiar Transformers.
MoR is like giving the model a small, reusable thinking engine that works harder only where needed. This allows it to match the quality of larger models while being cheaper, faster, and highly adaptable for specific needs.
Let’s take a closer look at this technical variation, what MoR can provide, and how effectively it can upgrade Transformer models and replace their old building patterns.
In today’s episode, we will cover:
The key idea behind Mixture-of-Recursions (MoR)
What powers MoR?
Routing mechanism
KV caching strategies
Actual results
Advantages of MoR
Not without limitations
Conclusion
Sources and further reading
What is Mixture-of-Recursions? How It Differs from Mixture of Expert
Scaling up language models makes them smarter – that is already an axiom. However, a constant challenge for developers is managing the huge resource demands, like computing power and memory, that come with this scaling of models, making them harder to train and run. In the race of creating the biggest and smartest AI models, there are two well-established paths that help them keep efficiency:
Parameter sharing – reusing the same model weights instead of having separate ones for every layer. A cool approach here is layer tying which involves looping text through the same layers multiple times.
Adaptive computation – deciding on the fly which parts to use for each token. Early exiting is a widespread method here. It lets easy tokens finish sooner, so the model doesn’t waste compute budget on them.
But why choose when we can use them both?
A big group of researchers from KAIST AI (they often have very interesting studies), Mila, Google Cloud, Google DeepMind, Google Research, and Université de Montréal decided to combine these two ideas in one system. They called it Mixture-of-Recursions, or MoR. It is a next-level version of Recursive Transformer (a model with a smaller set of layers that are reused multiple times in sequence) that learns to give each token its own “thinking depth” and optimizes memory use along the way.

Image Credit: Mixture-of-Recursions original paper
Wait, what is wrong with Recursive Transformer that we need a new approach like MoR? A few things to mention:
While in Recursive Transformer layers are shared, each recursion step usually has its own memory cache for attention, called KV cache, which takes up a lot of space and slows things down.
Most Recursive Transformers give every token the same recursion depth, although, some tokens might be easy to process and others might need more steps. This leads to the waste of compute power.
Early exit for tokens can solve this problem but often hurt performance and require extra engineering.
MoR builds on earlier approaches to adaptive computation, including Mixture-of-Depths (MoD), which routes tokens through different numbers of transformer layers rather than recursion steps.
So these issues show that we need another system that could solve them. Here is the tech stack that allows MoR to achieve its complex workflow concept and be a breakthrough on the way to smarter systems.
How Mixture of Recursions Architecture Workds: Routing & KV Cache
MoR has a small set of layers it reuses over and over. At each step, a “router” decides for every token whether it should keep going through the layers or stop. This shared “recursion step” can be repeated up to N times for each token, depending on what the router decides. In other words, MoR lets each token individually “choose” the number of recursion steps during training and inference. This positively influences Recursive Transformers, turning them into more adaptive systems.
Two main components make this workflow idea real:
Routing mechanism – “decides” how many times each token goes through the shared recursion block. This is about the recursion depth.
KV caching strategy – identifies how and when to store/reuse key–value (KV) pairs for attention at different recursion depths.
Well, first things first.
Routing mechanism
MoR has two ways to assign recursion depth:
Expert-choice routing
This method is like a dynamic early exit, where hard tokens keep going, and easy ones stop earlier and exit the recursion:

Image Credit: Mixture-of-Recursions original paper
MoR treats each recursion depth as a separate “expert.”
At each step, the router scores every token based on its hidden state and picks only the top-k tokens to continue.
As only the surviving tokens from the previous step are scored at the next step, this naturally creates hierarchical filtering, so the candidate pool gets smaller at each depth.
This method makes the compute budget per step fixed and predictable, but can cause information leakage during training as the router might indirectly “see” future tokens. So it needs extra solutions like auxiliary routers or an auxiliary loss to guide decisions.
Token-choice routing
This way the router decides the recursion depth once per token at the start. It assigns each token to a depth (like depth 1, 2, or 3) considering its initial hidden state. Tokens then go through that many recursion steps, no more, no less.

Image Credit: Mixture-of-Recursions original paper
Compared to expert-choice routing, token-choice routing has no leakage. But on the other hand, it’s hard to keep compute load balanced across depths, that’s why it needs balancing loss functions or specialized algorithms.
KV caching strategies
Why does MoR’s KV caching need some special tricks? Well, when a token exits early, its KV pairs for deeper steps simply don’t exist, and this can break attention for later tokens. That’s why MoR proposes two solutions, that also cut memory use and make everything faster.

Image Credit: Mixture-of-Recursions original paper
So here is how MoR stores and retrieves memory (key–value pairs) only for tokens that actually need more steps:
Recursion-wise KV caching: MoR stores KV pairs only for tokens still active at that recursion depth, so attention is limited to those cached tokens at that step. This helps to reduce memory use and the amount of data read from or written to memory (i/o) because you only store what’s needed.
Recursive KV sharing: All tokens go through at least the first recursion block, and KV caching happens only at this first step and is reused then for all later recursions. This method is used to speed up the start of generation, avoid recomputation and save even more memory.
While doing this, MoR also runs the expensive attention operation only for tokens still being processed at a given step.
In summary,
Routing makes MoR adaptive — each token gets as much computation as it needs.
KV caching makes MoR efficient — the model stores less and reuses more, reducing both memory and compute.
It’s important to notice that MoR doesn’t run all variants of routing and KV caching at once. For a given model configuration it can combine one routing strategy with one KV caching strategy, and that gives four valid combinations:
Expert-choice routing + Recursion-wise caching
Expert-choice routing + Recursive sharing
Token-choice routing + Recursion-wise caching
Token-choice routing + Recursive sharing
The choice between expert-choice vs token-choice routing and recursion-wise vs recursive sharing caching depends on trade-offs between speed, memory, and simplicity.
Now, let’s look at the real effectiveness of MoR.
Mixture of Recursion Benchmarks: Speed, Memory & Accuracy
Tests show that MoR has significant performance gains compared to Transformers that we are used to:
MoR with expert-choice routing and 2 recursions got lower validation loss and higher few-shot accuracy (43.1%) than a standard Transformer (42.3%) while using about 50% fewer parameters.

Image Credit: Mixture-of-Recursions original paper
Increasing recursions to 3 or 4 kept performance competitive with the full-size Transformer.
It cuts training time by 19% and peak memory use by 25%, thanks to hierarchical filtering and recursion-wise attention that shortens sequences for easier tokens.
With 3 recursions, token-choice routing performed worse (40.0% accuracy) than expert-choice (42.6%).
KV cache sharing saved memory but slightly hurt accuracy. However, it is still a good trade-off when memory is critical.
MoR uses continuous depth-wise batching, which keeps GPUs busy by batching tokens at different recursion depths together. At large batches it saw up to 2.06× throughput speedup.
In general, MoR gains more from adding parameters and training fewer steps more than from feeding it more data.
Researchers also tested different configurations and suggest that the best design choice is:
middle-cycle parameter sharing (it’s when the first and last layers remain unique and the middle layers are shared),
expert-choice routing with auxiliary loss and linear router, and
recursion-wise caching, unless you use token-choice routing, where KV sharing can help.
With all these results, let's now summarize everything and analyze the pros and cons of the MoR approach.
Why Recursive Architecture Beat Standard Transformer in Efficiency
Reusing layers saves parameters, as the model passes text through the same stack of layers multiple times, instead of having separate layers for each step.
MoR’s routing decisions aren’t random – they line up with token meaning. Easier tokens get fewer passes, harder ones get more.
Smarter attention & memory use: At each pass, the model only runs the expensive attention calculation for tokens still “active,” and only stores memory for those.
Efficient “reusing” strategy: A special version of KV cashing reuses some stored data (key-value pairs) from the first pass, which speeds things up and uses less memory.
As MoR is more compute-efficient, it processes more tokens within the same FLOP budget.
Enables test-time scaling: You can increase the maximum recursion depth at inference without retraining.
But, as usual ->
Mixture of Recursions Limitations & Open Questions
MoR’s limitations mostly come from the fact that it is still quite unexplored:
Difficulty in adjusting routing capacity after training: In expert-choice routing with auxiliary loss, the router’s outputs for “selected” and “unselected” tokens are almost perfectly separated. So it is hard to change the top-k (capacity) values at inference time, because the routing decisions become very rigid once trained.
The router hasn’t yet been optimized to dynamically adjust recursion depth based on reasoning complexity.
Current experiments only go up to 1.7B parameters. Larger-scale MoR models’ (>3B parameters) performance remains unknown.
The current design doesn’t allow easily changing how much compute is given to different tokens on the fly.
Not yet integrated with sparse computation techniques.
It’s also not yet tested beyond text.
So, in the end, where does that leave us with MoR?
Conclusion
Mixture-of-Recursions proves one of the most important current ideas, suggesting another future direction for AI development: Instead of endlessly adding more layers, models must “know” how to think deeper only when they need to. By rethinking recursion depth, routing, and memory, MoR becomes a smarter way of using what we already have.
It sets the stage for flexible reasoning depth, giving Transformers an adaptable backbone where each token gets just the right amount of processing. Despite being still quite early and requiring more testing, its customizable architecture design turns it into a universal system, like a real transformer toy. Just overcome the issues, and we can get a high-level specialized effectiveness.
Sources and further reading
Resources from Turing Post


