This website uses cookies

Read our Privacy policy and Terms of use for more information.

Quick answer: What are Attention Residuals and Mixture-of-Depths Attention?

Attention Residuals and Mixture-of-Depths Attention (MoDA) are two new Transformer techniques that treat depth as something the model can search over, not just pass through. Both start from the same diagnosis: in very deep Transformers, useful early-layer signals get diluted by repeated residual updates, so the model loses explicit control over which intermediate representations to preserve or reuse. Attention Residuals fixes this in the residual stream by learning how much to mix from earlier layers, while MoDA fixes it inside attention by letting heads retrieve keys and values from preceding layers. Together, they point to a broader shift: Transformer depth is becoming an addressable memory dimension, much like sequence already is.

Key concepts in this article

  • Signal dilution in deep Transformers: useful shallow-layer features get washed out as depth increases.

  • Learned depth selection: the model decides which earlier layers matter for the current token and context.

  • Attention Residuals (Kimi Team): replaces fixed residual accumulation with softmax attention over previous layer outputs.

  • Mixture-of-Depths Attention / MoDA (ByteDance Seed): extends attention so a token can retrieve information from earlier layers, not only earlier tokens.

  • Depth as retrieval: the model treats layer history as memory it can query dynamically.

  • Why this matters: better reasoning, stronger feature reuse, and more efficient scaling for deeper Transformers.

For many years we’ve been working with Transformers, they got deeper, but this depth mostly remained a passive stack. Residual connections carried history forward by adding each layer’s output to its input, but it was not explicitly preserved, selected, or revisited. Attention mechanisms give us a peek into what the model is focusing on when making decisions, but it mainly looked across tokens, not across layers. This techniques don’t give the model explicit control over which layer representations to preserve or reuse. As models get deeper, useful information from earlier layers gets diluted, degraded, or washed out by repeated updates.

Recently, we’ve started to see a new shift: depth should be treated as something the model can search over, not just pass through. Fixed depth propagation is finally turning into learned depth selection.

Where is this change coming from? Accidentally (or not), on March 16 alone, two big Chinese labs released closely related papers:

  • Kimi Team shared “Attention Residuals,” which argues that fixed unit-weight residual accumulation causes hidden-state growth and “dilutes each layer’s contribution,” so it replaces that with softmax attention over preceding layer outputs.

  • ByteDance Seed’s “Mixture-of-Depths Attention” paper claims that deep LLMs suffer from “signal degradation,” where informative shallow-layer features are gradually diluted by repeated residual updates, and proposes letting attention heads retrieve keys/values from preceding layers.

Same diagnosis, same philosophical move, but they approach it from slightly different angles.

Never before has depth in Transformers been treated so explicitly as a retrieval problem. This also signals a new wave of evolution for Transformers. They are starting to treat depth the way they already treat sequence: as an addressable dimension.

Today we’ll take a closer look at how this problem in Transformers can be addressed through the lens of two ideas from Kimi Team and ByteDance what they have in common, how they differ, and why it is important to push Transformers forward.

In today’s episode:

  • Why has “searching over depth” topic come up now?

  • Attention Residuals: Residuals, but Learned via Attention

    • How does it work?

    • Pros and cons of the approach

  • Mixture-of-Depths Attention (MoDA): Turning Depth into an Attention Axis

    • How it works

    • Advantages of MoDA

    • Trade-offs and limitations

  • Conclusion

  • Sources and further reading

Why has “searching over depth” topic come up now?

For most of their history, transformers treated depth as a fixed pipeline. Each layer updated the representation, passed it forward, and residual connections simply added everything together along the way. It worked, but it also meant that early, useful features were gradually diluted as models got deeper.

Actually, the idea that depth should be flexible or adaptive has been explored for years. By 2023–2025, many researchers started to notice this problem. Some tried to stabilize deep networks (DeepNet, residual scaling), others made depth more flexible (LayerDrop, early-exit, Mixture-of-Experts routing). A few explored cross-layer connections or feature reuse. But this all hinted at the same idea: not every layer matters equally, and depth shouldn’t be uniform.

However, all of these approaches stopped short of a bigger shift. Yes, depth could be skipped, scaled, or lightly reused, but it wasn’t something the model could actively and dynamically search through. And the problem starts with the basic ideas transformers are built on.

In general, LLMs rely on residual connections, where typically each layer just adds its output to everything that came before. It creates a “gradient highway,” so deep networks train stably. But if you unroll this recursion, you’ll see that each layer is effectively seeing a sum of all previous layer outputs with equal weight. Every past representation contributes with weight = 1, regardless of whether it is useful for the current input. As you go deeper, this leads to two problems:

  • You keep accumulating more and more values, and total magnitude (the size, or norm, of the hidden state vector) grows.

  • Individual layer contributions get diluted. Earlier signals are washed out by later additions, and later layers can’t selectively retrieve what they need.

In other words, in standard Transformers, each attention head only looks across the sequence dimension: for a token at each layer, it attends to past tokens in the same layer. Depth is handled separately through residual accumulation, which compresses everything into a single state.

So depth is treated very naively. Over time, all earlier information gets blended into a single vector (since it’s just a big sum of everything rather than a selective combination) and you lose control over what exactly is inside it. This really highlights an important fundamental issue: lack of selectivity across depth in Transformers.

Interestingly, Transformers became flexible at the sequence level thanks to attention. Instead of treating all tokens equally, attention adaptively weights token-to-token interactions and combines them based on relevance. Along the depth dimension, however, aggregation remains mostly fixed and uniform.

And this is what changes now. Modern research digs into this problem, showing that depth is no longer just a path the model follows. Depth is becoming a space the model can query and search over with attention, the same way it already searches over tokens. This opens up broader possibilities: the model can decide which earlier layers matter for this token, at this moment, in this context.

So what if we could take the same idea that fixed sequence modeling and apply it to depth? This how Kimi Team have started to explore this field. →

Attention Residuals: Residuals, but Learned via Attention

Kimi Team from Moonshot AI proposed their idea of Attention Residuals (AttnRes) to turn the residual stream into an attention problem. It replaces the uniform sum of all previous layer outputs with attention over all previous layers where weights depend on relevance: hl​=i∑​αi→l​⋅vi​ .

How does it work?

Technically, this is implemented with a very lightweight mechanism:

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying.

Join Premium members from top companies like Microsoft, NVIDIA, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI. 

  • Each layer has a learned query vector (used to selectively attend to earlier layer representations), which is compared to previous layer outputs (keys/values).

  • This produces weights αi→l that come from a softmax and depend on the input. The model first computes some scores based on the input, then applies softmax to turn those scores into weights α (that change for each example), sum to 1, and identify how much each component contributes.

  • A normalization step (RMSNorm) ensures that layers with large activations don’t dominate just because of scale.

Conceptually, if standard residuals behave like fixed linear attention over depth, AttnRes turns this into adaptive softmax attention over depth. Depth is no longer a chain of overwrites. It becomes a set of representations you can query.

As we look at it at the level of the entire model, the full version of AttnRead requires attending over all previous layers, which becomes expensive at scale not because of computation but because of storing and moving all the intermediate states across devices. Kimi Team have found a practical drop-in replacement for attending to every layer individually with minimal overhead and called it Block AttnRes. As it comes from its name, they group layers into blocks:

  • Inside a block, information is still summed like standard residuals.

  • Across blocks, attention is applied over block-level summaries.

Image Credit: Attention Residuals original paper

Block AttnRes combines this concept with several important engineering decisions for better scalability:

  • Two-phase computation: Instead of recomputing attention over all previous blocks at every layer, Block AttnRes computes inter-block attention once per block in parallel, then handles intra-block updates sequentially, and merges the two using online softmax.

  • Cross-stage caching in pipeline parallelism: The system caches previously seen blocks locally (each device caches what it already received), and only transmits new (incremental) blocks like [b1, b2] in the image below.

Image Credit: Attention Residuals original paper

  • For long sequences at inference, Block AttnRes uses sequence sharding + chunked prefilling to store block representations efficiently.

Pros and cons of the approach

Obviously, this leads to many interesting advantages for Transformers:

  • Hidden-state magnitudes stop growing with depth. This means no more uncontrolled accumulation.

  • Gradients become more evenly distributed across layers.

  • Layer contributions become meaningful instead of washed out.

  • Two-phase computation amortizes the cost of attention, turning repeated reads into a single batched operation.

  • Cross-stage caching reduces communication from quadratic-like growth to something much closer to linear in practice.

  • Block AttnRes shows about 1.25× compute advantage.

Image Credit: Attention Residuals original paper

  • Developers already tested Block AttnRes in large-scale training (like 48B total / 3B activated model trained on 1.4T tokens), which showed that it translates into consistent gains across tasks and compute budgets.

  • The biggest gains are on reasoning-heavy tasks: GPQA-Diamond 36.9 → 44.4 (+7.5), Math 53.5 → 57.1 (+3.6), HumanEval 59.1 → 62.2 (+3.1).

  • Efficiency-wise, Block AttnRes cuts residual-mechanism memory I/O to 5.5d per layer vs. 34d for mHC (Manifold-Constrained Hyper-Connections) from DeepSeek, with reported overhead of <4% in training and <2% in inference latency.

Image Credit: Attention Residuals original paper

Overall, AttnRes is essentially applying the same idea that made attention powerful over tokens, but along the depth dimension of the network. Depth becomes an attention-accessible memory. But this approach also has some trade-offs:

  • More engineering complexity: Block AttnRes requires extra components like RMSNorm on keys, block partitioning, cross-stage caching and two-phase computation.

  • Distributed training complexity: Under pipeline parallelism, the stored representations must be passed across stages, so communication becomes a real systems bottleneck.

  • This approach is also more sensitive to design choices. Ablations show performance drops without RMSNorm, with sigmoid instead of softmax, or with input-independent mixing.

  • It tends to favor deeper, narrower models, which can improve learning but may worsen inference latency because depth is still sequential.

Another fascinating approach we’ll look at comes from ByteDance Seed. It takes another angle, tackling the lack of selectivity across depth in Transformers with attention-based depth mixing. →

Mixture-of-Depths Attention (MoDA): Turning Depth into an Attention Axis

Mixture-of-Depths Attention (MoDA), a new development from ByteDance Seed, is built on the same core intuition as AttnRes, but implemented inside attention rather than in the residual path. MoDA extends attention to also operate over depth.

How it works

For a given token, the query now attends not only to sequence KV (key value) pairs at the current layer, but also to depth KV pairs coming from all previous layers at the same token position.

MoDA merges the depth retrieval with normal sequence attention into one unified attention operator. Technically, each attention head computes a mixture over these two sources, normalized under one shared softmax that allocates probability distribution across both sequence and depth memories. This means the model can dynamically decide whether to use contextual information (sequence) or retrieve earlier features from shallow layers (depth). That is why the paper calls it a mixture of sequence and depth.

Image Credit: Mixture-of-Depths Attention original paper

One of the main design aspects is that MoDA is implemented in a hardware-aware way. It fuses both sequence and depth into one attention pass, reusing the same online softmax state, so logits from sequence KV and depth KV contribute to one final softmax without materializing intermediate tensors.

The researchers also redesign the depth-KV layout:

  • A flattened layout stores per-token depth states contiguously. Each token has a block of size L, where L is the number of layers. Across a sequence of length T, this forms a T × L cache.

  • Then a chunk-aware layout ensures each query chunk only loads its relevant depth-KV region. So instead of scanning the full T × L, it focuses on a smaller chunk of size C × L.

  • Finally, a group-aware (GQA-based) optimization lets multiple queries share the same depth-KV block, effectively reducing this to (C / G) × L per chunk, and so reducing memory traffic and improving utilization.

Image Credit: Mixture-of-Depths Attention original paper

From the algorithm side, the forward pass therefore has three coordinated parts.

  • First, it tiles queries, keys, and values into hardware-friendly blocks and keeps hot data in on-chip SRAM rather than HBM as much as possible.

  • Second, it runs the normal sequence-attention blocks, updating online softmax statistics.

  • Third, it runs the depth-attention blocks over the corresponding depth-KV region, applying a mask that enforces “same token position across layers,” and merges those logits into the same online softmax state. Only at the end is the output normalized once.

Image Credit: Mixture-of-Depths Attention original paper

This fusion is the reason MoDA stays close to FlashAttention efficiency despite having an extra depth path, and the performance results proves its competitiveness. →

Advantages of MoDA

  • Conceptually, MoDA directly addresses information dilution. You don’t need to hope useful early-layer signals survive through residual accumulation because the model can explicitly retrieve them when needed.

  • Unified attention over sequence + depth provides better compositional reasoning, combining context and intermediate states in one softmax.

  • As sequence length grows, the relative overhead of the added depth path shrinks from 25.86% at 4K to 2.73% at 64K, because the sequence-attention workload dominates and the depth cost is amortized.

  • MoDA improves downstream performance by about +1.8–2.1 points on average when scaling from 700M to 1.5B models.

  • It also reduces perplexity, for example from 13.67 to 13.47 at 1.5B scale.
    The gains are especially strong on reasoning tasks, such as WinoGrande (+4.9) and COPA (+5.0).

    Image Credit: Mixture-of-Depths Attention original paper

  • These improvements come at a very low cost (~+3.7% FLOPs).

  • Thanks to added chunk-aware + GQA-aware kernels, MoDA reaches 97,3% of FlashAttention-2 efficiency, despite the added flexibility. Plus, it achieves up to ~1450× speedup over naive implementations.

In general, compared with residual accumulation, MoDA makes depth retrieval dynamic, which makes depth behave more like an explicit memory that attention can query. Compared with dense cross-layer connections, it keeps the cost manageable. And compared with a naive “attention over layers” formulation, it is also built to be GPU-friendly at long context and moderate overhead. But as always there are →

Trade-offs and limitations

  • As Block AttnRes approach, MoDa introduces additional engineering complexity. But here is it the need for custom kernels (chunking, grouping, fused softmax). So So it is not plug-and-play in general.

  • Actually, memory overhead grows with depth, and compute still scales with depth in the attention part.

  • Depth-KV cache management becomes a bottleneck, which needs special decisions like slot caching (fixing a number of stored depth KV states).

Conclusion

So what do we have? The approaches discussed today explore transformer depth from two different perspectives:

  • Attention Residuals modifies the residual pathway. Its main idea is: When combining representations across layers, why should every earlier layer count equally? So it learns input-dependent weights over earlier layer outputs.

  • Mixture-of-Depths Attention modifies the attention mechanism itself. Each attention head can attend not only to the current layer’s sequence KV pairs, but also to depth KV pairs from preceding layers.

To put it more simply, Attention Residuals are about making the residual stream depth-aware, while MoDA makes the attention heads depth-aware.

As they target different parts of the computation graph, in principle, they can be combined in one model, where AttnRes is responsible for how layers aggregate representations across depth and MoDA defines how layers retrieve information across layers.

Globally, this shift allows Transformers to fully turn into 2D systems with depth as a first-class controllable axis, like time already is. And it also opens up possibilities that aren’t immediately obvious:

Firstly, selective reuse of intermediate representations allows to recover early features and is especially useful for multi-step reasoning.

Secondly, long-range computation across layers enables hierarchical reasoning chains. Layers start to behave less like a chain and more like a memory system, which broadens the entire idea of Transformers: they are evolving into architectures that can store, retrieve, and recombine information across layers, not just across sequences of tokens.

Sources and further reading

  • Attention Residuals | paper | GitHub

  • Mixture-of-Depths Attention | paper | GitHub

  • Attention Is All You Need | Paper

  • Training Very Deep Networks | Paper

  • Deep Residual Learning for Image Recognition | Paper

Resources from Turing Post

Reply

Avatar

or to participate

Keep Reading