- Turing Post
- Posts
- AI 101: What is Mixture-of-Recursions (MoR)?
AI 101: What is Mixture-of-Recursions (MoR)?
Explore a configurable version of Transformer with the big-model quality but without the big-model cost
Deep research, deep thinking, and deep analysis are what make AI more capable today. Since much compute power has been wasted on expanding reasoning, researchers began to consider the flexibility and customizable use of compute budgets and memory across different tasks and tokens.
Here is something quite new from KAIST, Google, Mila, and Université de Montréal that goes truly deeper into this “thinking depth” concept and meets the requirements of novel efficient systems. Their creation is called Mixture-of-Recursions (MoR).
The main feature of MoR lies in the optimal reuse of layers, so each token goes through exactly as much processing as it needs. With a customizable combination of two types of routing mechanisms and two types of KV (key-value) caching, it provides a solid tech stack for our familiar Transformers.
MoR is like giving the model a small, reusable thinking engine that works harder only where needed. This allows it to match the quality of larger models while being cheaper, faster, and highly adaptable for specific needs.
Let’s take a closer look at this technical variation, what MoR can provide, and how effectively it can upgrade Transformer models and replace their old building patterns.
In today’s episode, we will cover:
The key idea behind Mixture-of-Recursions (MoR)
What powers MoR?
Routing mechanism
KV caching strategies
Actual results
Advantages of MoR
Not without limitations
Conclusion
Sources and further reading
The key idea behind Mixture-of-Recursions (MoR)
Scaling up language models makes them smarter – that is already an axiom. However, a constant challenge for developers is managing the huge resource demands, like computing power and memory, that come with this scaling of models, making them harder to train and run. In the race of creating the biggest and smartest AI models, there are two well-established paths that help them keep efficiency:
Parameter sharing – reusing the same model weights instead of having separate ones for every layer. A cool approach here is layer tying which involves looping text through the same layers multiple times.
Adaptive computation – deciding on the fly which parts to use for each token. Early exiting is a widespread method here. It lets easy tokens finish sooner, so the model doesn’t waste compute budget on them.
But why choose when we can use them both?
A big group of researchers from KAIST AI (they often have very interesting studies), Mila, Google Cloud, Google DeepMind, Google Research, and Université de Montréal decided to combine these two ideas in one system. They called it Mixture-of-Recursions, or MoR. It is a next-level version of Recursive Transformer (a model with a smaller set of layers that are reused multiple times in sequence) that learns to give each token its own “thinking depth” and optimizes memory use along the way.

Image Credit: Mixture-of-Recursions original paper
Wait, what is wrong with Recursive Transformer that we need a new approach like MoR? A few things to mention:
While in Recursive Transformer layers are shared, each recursion step usually has its own memory cache for attention, called KV cache, which takes up a lot of space and slows things down.
Most Recursive Transformers give every token the same recursion depth, although, some tokens might be easy to process and others might need more steps. This leads to the waste of compute power.
Early exit for tokens can solve this problem but often hurt performance and require extra engineering.
So these issues show that we need another system that could solve them. Here is the tech stack that allows MoR to achieve its complex workflow concept and be a breakthrough on the way to smarter systems.
What powers MoR?
Join Premium members from top companies like Microsoft, Google, Hugging Face, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Simplify your learning journey 👆🏼

Reply