Quick answer: What are Attention Residuals and Mixture-of-Depths Attention?
Attention Residuals and Mixture-of-Depths Attention (MoDA) are two new Transformer techniques that treat depth as something the model can search over, not just pass through. Both start from the same diagnosis: in very deep Transformers, useful early-layer signals get diluted by repeated residual updates, so the model loses explicit control over which intermediate representations to preserve or reuse. Attention Residuals fixes this in the residual stream by learning how much to mix from earlier layers, while MoDA fixes it inside attention by letting heads retrieve keys and values from preceding layers. Together, they point to a broader shift: Transformer depth is becoming an addressable memory dimension, much like sequence already is.
Subscribe for weekly operator-grade AI systems analysis:
https://www.turingpost.com/subscribe
Key concepts in this article
Signal dilution in deep Transformers: useful shallow-layer features get washed out as depth increases.
Learned depth selection: the model decides which earlier layers matter for the current token and context.
Attention Residuals (Kimi Team): replaces fixed residual accumulation with softmax attention over previous layer outputs.
Mixture-of-Depths Attention / MoDA (ByteDance Seed): extends attention so a token can retrieve information from earlier layers, not only earlier tokens.
Depth as retrieval: the model treats layer history as memory it can query dynamically.
Why this matters: better reasoning, stronger feature reuse, and more efficient scaling for deeper Transformers.
For many years we’ve been working with Transformers, they got deeper, but this depth mostly remained a passive stack. Residual connections carried history forward by adding each layer’s output to its input, but it was not explicitly preserved, selected, or revisited. Attention mechanisms give us a peek into what the model is focusing on when making decisions, but it mainly looked across tokens, not across layers. This techniques don’t give the model explicit control over which layer representations to preserve or reuse. As models get deeper, useful information from earlier layers gets diluted, degraded, or washed out by repeated updates.
Recently, we’ve started to see a new shift: depth should be treated as something the model can search over, not just pass through. Fixed depth propagation is finally turning into learned depth selection.
Where is this change coming from? Accidentally (or not), on March 16 alone, two big Chinese labs released closely related papers:
Kimi Team shared “Attention Residuals,” which argues that fixed unit-weight residual accumulation causes hidden-state growth and “dilutes each layer’s contribution,” so it replaces that with softmax attention over preceding layer outputs.
ByteDance Seed’s “Mixture-of-Depths Attention” paper claims that deep LLMs suffer from “signal degradation,” where informative shallow-layer features are gradually diluted by repeated residual updates, and proposes letting attention heads retrieve keys/values from preceding layers.
Same diagnosis, same philosophical move, but they approach it from slightly different angles.
Never before has depth in Transformers been treated so explicitly as a retrieval problem. This also signals a new wave of evolution for Transformers. They are starting to treat depth the way they already treat sequence: as an addressable dimension.
Today we’ll take a closer look at how this problem in Transformers can be addressed through the lens of two ideas from Kimi Team and ByteDance – what they have in common, how they differ, and why it is important to push Transformers forward.
In today’s episode:
Why has “searching over depth” topic come up now?
Attention Residuals: Residuals, but Learned via Attention
How does it work?
Pros and cons of the approach
Mixture-of-Depths Attention (MoDA): Turning Depth into an Attention Axis
How it works
Advantages of MoDA
Trade-offs and limitations
Conclusion
Sources and further reading
Why has “searching over depth” topic come up now?
For most of their history, transformers treated depth as a fixed pipeline. Each layer updated the representation, passed it forward, and residual connections simply added everything together along the way. It worked, but it also meant that early, useful features were gradually diluted as models got deeper.
Actually, the idea that depth should be flexible or adaptive has been explored for years. By 2023–2025, many researchers started to notice this problem. Some tried to stabilize deep networks (DeepNet, residual scaling), others made depth more flexible (LayerDrop, early-exit, Mixture-of-Experts routing). A few explored cross-layer connections or feature reuse. But this all hinted at the same idea: not every layer matters equally, and depth shouldn’t be uniform.
However, all of these approaches stopped short of a bigger shift. Yes, depth could be skipped, scaled, or lightly reused, but it wasn’t something the model could actively and dynamically search through. And the problem starts with the basic ideas transformers are built on.
In general, LLMs rely on residual connections, where typically each layer just adds its output to everything that came before. It creates a “gradient highway,” so deep networks train stably. But if you unroll this recursion, you’ll see that each layer is effectively seeing a sum of all previous layer outputs with equal weight. Every past representation contributes with weight = 1, regardless of whether it is useful for the current input. As you go deeper, this leads to two problems:
You keep accumulating more and more values, and total magnitude (the size, or norm, of the hidden state vector) grows.
Individual layer contributions get diluted. Earlier signals are washed out by later additions, and later layers can’t selectively retrieve what they need.
In other words, in standard Transformers, each attention head only looks across the sequence dimension: for a token at each layer, it attends to past tokens in the same layer. Depth is handled separately through residual accumulation, which compresses everything into a single state.
So depth is treated very naively. Over time, all earlier information gets blended into a single vector (since it’s just a big sum of everything rather than a selective combination) and you lose control over what exactly is inside it. This really highlights an important fundamental issue: lack of selectivity across depth in Transformers.
Interestingly, Transformers became flexible at the sequence level thanks to attention. Instead of treating all tokens equally, attention adaptively weights token-to-token interactions and combines them based on relevance. Along the depth dimension, however, aggregation remains mostly fixed and uniform.
And this is what changes now. Modern research digs into this problem, showing that depth is no longer just a path the model follows. Depth is becoming a space the model can query and search over with attention, the same way it already searches over tokens. This opens up broader possibilities: the model can decide which earlier layers matter for this token, at this moment, in this context.
So what if we could take the same idea that fixed sequence modeling and apply it to depth? This how Kimi Team have started to explore this field. →
Attention Residuals: Residuals, but Learned via Attention
Kimi Team from Moonshot AI proposed their idea of Attention Residuals (AttnRes) to turn the residual stream into an attention problem. It replaces the uniform sum of all previous layer outputs with attention over all previous layers where weights depend on relevance: hl=i∑αi→l⋅vi .
How does it work?
Technically, this is implemented with a very lightweight mechanism:
Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying.
Join Premium members from top companies like Microsoft, NVIDIA, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI.


