For most of deep learning’s history, memory was treated as something to be baked into the model. If a system needed to know more, we gave it more parameters. If it needed to remember longer, we extended the context window. If it forgot, we retrained.
That approach worked astonishingly well. It also locked us into a very specific assumption: that intelligence scales by touching everything, every time.
We are now watching how that assumption is changing.
A recent paper on Conditional Memory via Scalable Lookup from DeepSeek and Peking University does not introduce a better model or a new benchmark record (which they are also good at). What they come up with is a different organizing principle. They called it Engram – a conditional memory module that treats memory as something the model chooses to access.
That choice turns out to matter more than it first appears. I’ll call this shift selective intelligence.
In today’s episode, we will cover:
Why memory inside LLMs is starting to crack
Four kinds of “memory” we keep confusing
Why this is about routing, not retrieval
How Engram works: architecture and technical design
The U-Shaped allocation law: a fundamental discovery
Why Engram improves reasoning, not just memorization
Long-context performance: structural advantages
System efficiency: decoupling compute and memory
Large-scale pre-training: empirical validation
Not without limitations
Conclusion: selective intelligence as an architectural principle
Sources and further reading
Why memory inside LLMs is starting to crack
Large language models store knowledge implicitly. Facts, patterns, abstractions, and behaviors are distributed across billions of parameters. When a prompt arrives, attention mechanisms sweep across a dense internal space, activating everything that might be relevant. This design has two structural consequences:
First, recall is expensive. Every token pays the cost of dense computation, even when only a small fraction of the model's knowledge is actually needed. Processing the phrase "Alexander the Great" requires multiple early layers to recognize this as a single composite entity, not three unrelated words. The model essentially reconstructs a static lookup table at runtime, consuming sequential depth that could be allocated to higher-level reasoning.
Second, memory is passive. The model cannot decide not to think about something. It can only weigh it down through attention scores. As models scale, both problems become structural. Longer contexts increase cost quadratically in standard attention mechanisms. Larger models increase latency and energy usage. And attempts to fix memory by brute force – more parameters, longer contexts – begin to resemble the same strategy that created the problem.
This is the backdrop against which conditional memory becomes interesting.
Four kinds of “memory” we keep confusing
Before getting into the technique itself, it helps to separate three ideas that are often collapsed into one.

Image Credit: created with ChatGPT
Parametric memory represents the traditional approach. Knowledge is encoded in weight matrices and recalled via dense forward passes through every layer. This is powerful – the model can perform implicit reasoning over its entire knowledge base – but it is expensive to update (requiring retraining) and expensive to access (requiring full forward passes). Crucially, the model has no mechanism to retrieve specific knowledge without activating large portions of its parameter space.
Context window expansion extends attention over more tokens. Everything stays inside the Transformer, and the model attends over increasingly long sequences. This is powerful for incorporating new information at inference time, but complexity scales quadratically and the approach is indiscriminate. The model still processes everything in the context window, whether it matters or not. There is no selectivity, only inclusion.
External retrieval (RAG) externalizes memory entirely. Documents are retrieved from outside the model and injected into the prompt. This adds scale – you can query arbitrary databases – but the interface is brittle. Retrieval happens before reasoning, and the model has limited control over what comes back. The retrieval system does not benefit from the model's internal representations, and the model cannot iterate on what to retrieve based on partial reasoning.
Conditional memory introduces a fourth paradigm: learned sparse lookup. Memory lives in structured slots, the model decides whether to query them during the forward pass, and retrieval is internal to inference. The key innovation is not where the memory lives, but who controls access to it. The model learns when memory is worth consulting. How cool is that.
This creates a new axis alongside parameter sparsity (Mixture-of-Experts), activation sparsity (pruned neurons), and context sparsity (selective attention). And unlike RAG, the lookup mechanism is trained end-to-end. The model is not handed memory. It learns when and how to use it.
Why this is about routing, not retrieval
The most productive way to understand conditional memory is to see it as a routing problem.
Modern AI systems already route:
tokens to parameters in MoE models
queries to tools in agentic systems
tasks to specialized submodels
Conditional memory adds another decision point. The model decides whether a piece of knowledge should be pulled into the computation at all.

Image Credit: A conceptual routing schematic diagram, created with ChatGPT
The storage medium doesn’t matter here. Selective intelligence starts to play its role. Systems are being built that increasingly spend compute only where it changes the outcome.
This same pattern shows up across the stack.
How Engram works: architecture and technical design
The Engram module modernizes classic N-gram embeddings for modern deep learning. An N-gram is simply a sequence of N consecutive tokens. What DeepSeek discovered is that, with the right architectural adaptations, this decades-old technique becomes an ideal complement to contemporary Mixture-of-Experts models.

Image Credit: The original paper
The architecture operates in two phases: sparse retrieval and context-aware fusion.
Sparse retrieval via hashed N-grams. The system uses 2-grams and 3-grams as keys to address memory. For a given token position, the model extracts the suffix context – the two or three tokens preceding the current position – and uses this as a lookup key. But tokenizers like BPE or SentencePiece assign disjoint IDs to semantically equivalent terms (Apple vs. ␣apple, for instance). To maximize semantic density, Engram implements a vocabulary projection layer that compresses the raw 128,000-token vocabulary by 23 percent using normalized text equivalence: NFKC normalization, lowercasing, and other canonical transformations. This maps raw token IDs to canonical identifiers before hashing.
Directly parameterizing all possible N-grams is intractable – the combinatorial space explodes. Instead, Engram employs multi-head hashing. For each N-gram order (2-grams and 3-grams), the system uses K distinct hash functions. Each hash head maps the compressed context to an index within a massive embedding table via a lightweight multiplicative-XOR hash. This deterministic function ensures O(1) lookup time. Using multiple hash heads mitigates collisions: if one hash accidentally maps "neural network" and "network failure" to the same slot, other heads will likely place them in different slots. The final memory vector concatenates all retrieved embeddings from all heads and N-gram orders.
Context-aware gating. The retrieved embeddings are context-independent priors. Being static, they cannot adapt to polysemy or resolve hash collisions based on surrounding context. To add contextual sensitivity, Engram employs a gating mechanism inspired by attention. The current hidden state – which has already aggregated global context via preceding attention layers – acts as a dynamic Query. The retrieved memory serves as the source for both Key and Value projections. The model computes a scalar gate α ∈ (0,1) by measuring the alignment between the normalized Query and Key vectors. If the retrieved memory contradicts the current context – if "bank" was retrieved but the sentence is about rivers, not finance – the gate approaches zero, suppressing the noise. If the memory is relevant, the gate approaches one, and the retrieved information is fused into the residual stream.
Finally, a short depthwise causal convolution with kernel size 4 and dilation equal to the maximum N-gram order expands the receptive field and adds non-linearity. This refined representation is added to the hidden state via a residual connection, followed by standard Attention and MoE layers.
Crucially, Engram is not applied to every layer. The specific placement is governed by both modeling considerations and system-level latency constraints, which we will return to shortly.
Integration with multi-branch architecture. Rather than standard single-stream residual connections, DeepSeek uses an advanced multi-branch architecture (Manifold-Constrained Hyper-Connections, or mHC) that expands the residual stream into four parallel branches with learnable connection weights. Engram adapts to this by sharing a single sparse embedding table and Value projection matrix across all branches, while using four distinct Key projection matrices to enable branch-specific gating behaviors. This allows the linear projections to be fused into a single dense FP8 matrix multiplication, maximizing GPU compute utilization.
The U-Shaped allocation law: a fundamental discovery
DeepSeek's most important contribution is not just Engram itself. It is the empirical discovery of how to optimally allocate sparse capacity between neural computation and static memory.
The team formulated the Sparsity Allocation problem: Given a fixed total parameter budget and a fixed computational budget (activated parameters per token), what fraction of the inactive parameter budget should go to MoE experts versus Engram memory?
They defined an allocation ratio ρ ∈ [0,1] as the fraction of inactive parameters assigned to MoE expert capacity. When ρ = 1, you have a pure MoE model – all inactive parameters are routed experts. When ρ < 1, you reduce the number of routed experts and reallocate the freed parameters to Engram embedding slots.
The experiments revealed a robust U-shaped relationship between validation loss and ρ across two compute regimes. In the 9.9B parameter regime, pure MoE achieved a validation loss of 1.7248. The optimal allocation (ρ ≈ 75-80 percent) improved this to 1.7109 – a gain of 0.0139 points from pure architectural reallocation without any additional compute. The U-shape was stable: the location of the optimum did not drift significantly across scales.
This curve encodes a fundamental architectural principle. MoE-dominated models (ρ → 100 percent) lack dedicated memory for static patterns, forcing them to inefficiently reconstruct these patterns through depth and computation. Engram-dominated models (ρ → 0 percent) lose conditional computation capacity, hurting tasks that require dynamic, context-dependent reasoning. Memory cannot replace computation in this regime. The optimum lies in the middle, where each primitive handles what it does best.
The stability of this optimum across scales suggests we are observing a property of linguistic structure, not an artifact of training. Language genuinely requires both static retrieval (entities, formulaic expressions, common collocations) and dynamic reasoning (compositional semantics, pragmatic inference, world modeling). Models perform best when architecture reflects this duality.
Additionally, when memory capacity is scaled aggressively while holding computation fixed, validation loss improves linearly with the logarithm of memory size. This demonstrates that memory and computation are independent scaling axes, each following different laws – and both are necessary.
Why Engram improves reasoning, not just memorization
The gains in reasoning and code domains demand explanation. Memory is supposed to help with facts, not math. DeepSeek conducted mechanistic analyses using LogitLens and Centered Kernel Alignment to understand what Engram changes inside the model.
They noticed three effects:
Effective depth increase. By offloading static pattern reconstruction from early layers, Engram essentially deepens the network. Early layers no longer waste capacity on trivial lookup tasks. They can immediately begin compositional reasoning. The model has more sequential depth available for the hard parts of the problem. This is why reasoning benchmarks improve: the model has more layers to think with.
Attention capacity liberation. Local dependencies – 2-gram and 3-gram patterns – are delegated to O(1) lookups. This frees attention mechanisms to focus on global context and long-range dependencies. Attention does not need to spend capacity recognizing that "Alexander the Great" is a single entity; that lookup happened in parallel. Attention can focus on relationships across sentences, coreference resolution, and pragmatic inference.
Zipfian access patterns enable caching. Natural language N-grams follow a Zipfian distribution: a small fraction of patterns accounts for the vast majority of accesses. This enables multi-level cache hierarchies: frequently accessed embeddings stay in fast GPU HBM, moderately common patterns reside in host DRAM, and the long tail of rare patterns can be offloaded to slower storage. Because access is deterministic and predictable, the system can prefetch intelligently.
These mechanisms explain why Engram improves reasoning tasks. Memory doesn’t replace reasoning – memory removes low-level reconstruction work, allowing the model to allocate more depth and attention capacity to reasoning.
Long-context performance: structural advantages
By delegating local dependencies to lookups, Engram preserves attention capacity for managing global context. To test this, DeepSeek conducted long-context extension training: after pre-training, they trained both MoE-27B and Engram-27B on 30B tokens of high-quality long-context data with 32,768-token context windows.
They evaluated on LongPPL (perplexity across books, papers, code, and long chain-of-thought) and RULER (comprehensive long-context benchmark including multi-query needle-in-a-haystack and variable tracking).
The results demonstrate architectural superiority. DeepSeek selected an Engram-27B checkpoint that matched the baseline's pre-training loss, controlling for base model capability. After identical long-context training, Engram-27B achieved Multi-Query NIAH accuracy of 97.0 versus 84.2 for the baseline, and Variable Tracking 87.2 versus 77.0. These gains are not marginal. They suggest that attention capacity is genuinely freed for global context when local patterns are offloaded.
System efficiency: decoupling compute and memory
Unlike MoE, which relies on runtime hidden states for dynamic routing, Engram's retrieval indices depend solely on the input token sequence. The system knows in advance which embeddings will be needed. This enables a prefetch-and-overlap strategy: the system asynchronously retrieves embeddings from host memory while the GPU processes preceding layers.
Empirical results show that offloading a 100B-parameter table to host memory incurs less than 3 percent overhead. This is infrastructure-aware efficiency: the architecture is designed with hardware constraints in mind. The system decouples storage from compute, bypassing GPU memory constraints and enabling aggressive parameter expansion without proportional hardware costs.
The implication is that Engram can scale memory far beyond what fits in GPU HBM, using abundant host DRAM and storage capacity with minimal latency impact.
Large-scale pre-training: empirical validation
Guided by the U-shaped allocation law, DeepSeek trained three models on 262 billion tokens using identical data curriculum and optimization:

The results are striking. Engram-27B consistently improves over the iso-parameter, iso-FLOPs MoE-27B baseline:

Results are impressive: Engram improves performance across all domains, with the largest gains in reasoning and long-context tasks rather than pure knowledge retrieval.
Not without limitations
We always keep limits in mind. It’s important to understand that conditional memory does not make models human-like. It does not solve hallucinations – hash collisions and polysemy mean wrong memory can still be retrieved. It does not replace alignment or careful system design.
What it does enable is architectural decoupling: knowledge scale from dense compute, recall from full context, capacity from latency. Engram opens a path toward systems that can operate over long horizons without dragging their entire history into every step. It establishes a quantitative framework – the U-shaped allocation law – for deciding how much capacity should be memory versus compute.
The technique requires careful engineering and the gains are task-dependent, but within its scope, Engram is effective. It provides a modeling primitive that has been missing from Transformers since their inception: a native mechanism for static knowledge lookup.
Conclusion: selective intelligence as an architectural principle
Engram fits naturally into a broader shift toward dynamic inference and agentic systems. Models are no longer single-shot predictors. They branch, retry, deliberate, and operate over extended horizons. In that setting, dense recall becomes a liability. Systems that act over hours, days, or weeks cannot afford to carry everything with them, nor can they waste early layers reconstructing static patterns that never change. Architectural decisions that once looked minor during pre-training become decisive during deployment.
Conditional memory addresses this constraint directly. It treats memory as a first-class modeling primitive with its own scaling laws, decoupled from dense computation and context length. It shows that optimal architectures are neither purely dense nor purely conditional, but hybrids that allocate capacity according to empirically grounded principles. Some knowledge is static, local, and stereotyped, and for that knowledge a lookup table is simply better than a forward pass. Other knowledge requires depth, composition, and reasoning, and that capacity should be preserved.
This is the deeper implication of Engram. It moves us away from models that remember by accident and toward systems that remember by choice. Such selective intelligence may define the future of scaling. We need our systems to learn what they can safely ignore. And that is a very different idea of intelligence.
Sources and further reading
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models by Dai, D., et al. (2024) Paper (pdf)
Enriching Word Vectors with Subword Information by Bojanowski, P., et al. (2017) (Original N-gram embeddings) Paper
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer by Shazeer, N., et al. (2017). (Original MoE) Paper
Adaptive Computation Time for Recurrent Neural Networks by Graves A., (2016) Paper
On the Benefits of Learning to Route in Mixture-of-Experts Models by Nishanth Dikkala et al. (2023) Paper (pdf)
Resources from Turing Post








