We want LLMs to be as fast and accurate as they can, so chasing high inference speed is an important and non-trivial job for developers. Since LLMs run on hardware, there are two things you can rely on for processing speed:
Computation – how many math operations the hardware must perform.
Memory – how much data needs to be moved in and out of memory.
The thing is that LLMs are more memory-bound than compute-bound. We are in the era when generating a token requires only small matrix-vector multiplications, but involves loading large amounts of data from memory, and this process is constantly repeated. For LLM inference, memory is a bigger issue than math (hardware compute), which gives us a lot of research trying to overcome these memory limits.
One of them is a recent study from UC Berkeley, FuriosaAI, ICSI, and LBNL on XQuant – a method that can cut memory use up to 12 times by adding a little bit more compute cost. XQuant and its variation XQuant-CL bypass typical techniques like KV cache quantization that usually lead to accuracy drop.
So today, starting from KV caching, we will follow the journey of the XQuant creators to see how this new technique actually works and what it is really capable of. Join us!
In today’s episode, we will cover:
KV cache and KV caching
KV cache compression and its limitations
What is XQuant?
How does XQuant work?
XQuant-CL: A better version
XQuant working with Grouped-Query Attention (GQA)
Performance gains
The benefits of XQuant and XQuant-CL
Not without limitations
Conclusion
Sources and further reading
What is KV cache? How KV caching works
Today we focus on how to deal with large memory use. For short texts, the model weights are the main memory load, while for longer texts, the bigger problem is the Key-Value (KV) cache. It stores the representation of the whole sequence for self-attention to help the model track the full context, growing linearly with input length, which causes limitations.
One of the most popular method is to compress the KV cache through quantization. But first of all, let’s look at what KV cache exactly is.
In transformer models, every time a token is processed, the attention mechanism needs Keys (K) and Values (V) to figure out how much each previous token should influence the current one. Each token is represented by these three vectors:
Key (K): Encodes what information a token represents.
Value (V): Encodes the actual content to be passed along if that token is selected as relevant.
Query (Q): Encodes the features used to compute similarity scores with Keys (K).
The model’s attention mechanism compares a token’s Query with all previous Keys. The resulting similarity scores are used to weight the corresponding Values and to produce a context-aware representation of the token.
During autoregressive text generation, the model typically recomputes all K and V matrices from scratch for every new token. However, this is redundant and computationally expensive, since the cost grows with the length of the sequence.
Overall, the KV cache is a memory structure that stores the previously computed Keys and Values, so they don’t need to be recalculated at each step of decoding.
And what is KV caching?
KV caching is the process of using the KV cache during inference. Instead of recalculating attention against the entire sequence of tokens every time, the model only does new work for the new tokens:
It computes and stores the Keys and Values for the first tokens in GPU memory.
Then, at each step, the model retrieves stored Keys and Values, adds the new token’s Keys and Values, and computes attention only for the new token’s Query.

Image Credit: Mastering LLM Techniques: Inference Optimization, NVIDIA blog
In this case, the cache grows with each step and attention is computed much faster.
As we have already mentioned KV cache grows linearly with input length, exploding memory usage. There are two common ways to compress KV cache.
KV cache compression methods and their limits
The first method is quantization. Compressing the KV cache using quantization means representing it with fewer bits. Thanks to this, more tokens can be cached, enabling longer context windows and reducing memory use. However, when the quantization goes too low, like 2–3 bits, model accuracy drops sharply.
Another approach is to shrink the KV cache using low-rank decomposition – compressing the KV cache into smaller spaces. This means breaking large matrices into smaller, compressed forms. However, this method also comes at an accuracy cost: it risks throwing away important information and is mathematically heavy because of compression and decompression that happens across different ranks.
Another interesting idea came from OpenMachine and is called Slim Attention. It stores only Keys and uses math tricks to recover Values from them. But this requires unstable matrix inverses and also doesn’t work well with Rotary Position Embeddings (RoPE) or Grouped-Query Attention (GQA).
So all these methods point out a big issue:
GPUs will keep getting faster at computation, but memory growth will lag behind;
Plus, accuracy drops with methods like KV cache quantization or low-rank decomposition.
That’s why we need something that can save memory better, be more accurate and easier to compress. And here is one of the freshest solutions. Another direction attacking the same memory wall is conditional memory in LLMs — a selective activation principle where models choose what knowledge to access rather than loading everything at once.
What is XQuant?
XQuant is a new method proposed by a group of researchers from UC Berkeley, FuriosaAI, ICSI, and LBNL that trades some extra computation for much less memory usage when running LLMs and breaks, as the researchers call it, the Memory Wall. They follow the idea: GPUs can theoretically do huge amounts of math, but they can’t feed themselves data fast enough; so the logical solution is to reduce memory operations, even if it costs additional compute.
Instead of storing the usual KV cache, XQuant quantizes and stores only the layer input activations, called X. Then, when needed, it rematerializes (recalculates) Keys and Values from X on the fly during inference. This method allow not to store everything in memory, but throw away some data and recompute it later when needed. Let’s look at the inner side of this process. It’s fascinating →
How XQuant works: storing X instead of KV
XQuant stores less and recalculates more, and here is how it does it in several steps:
It stores the input activations X – the representation after layer normalization.
X is then quantized, meaning compressed into fewer bits.
When Keys and Values are needed, they are recomputed on-the-fly from X by multiplying with the Key and Value projection matrices: Wk and Wv.

Image Credit: XQuant original paper
This is a good tradeoff because storing X requires only about half the memory compared to standard KV caching. Researchers didn’t stop on this version of their method and also created something more interesting.
XQuant-CL explained: cross-layer compression
Developers of XQuant noticed that X looks similar across layers. This happens because each layer only slightly refines the input before passing it along – it is called residual stream.
This hint encouraged researchers to make their approach even smarter, creating XQuant-CL that doesn’t store each full X, but stores only the differences (deltas) between them. This makes extreme compression possible:
With 3-bit quantization → 10× memory savings and almost no accuracy loss.
With 2-bit quantization → 12.5× memory savings and still very close to the original model’s accuracy.
More precisely XQuant-CL’s step-by-step workflow looks like this:

Image Credit: Xquant-CL, XQuant original paper
First layer (X₀) is stored at higher precision as the baseline.
Next layers (X₁, X₂, …):
The small changes from the previous layer are computed: ΔX = Xᵢ − Xᵢ₋₁ .
As these deltas have a smaller range of values, they can be quantize aggressively, so each delta is actually quantized and cached.
Reconstruction: To recover X for a layer, XQuant-CL just adds up the deltas:
𝑋 ^ 𝑖 = 𝑋 0 + Δ 𝑋 ^ 1 + Δ 𝑋 ^ 2 + … + Δ 𝑋 ^ 𝑖.
This builds back the refined embedding step-by-step.
To avoid constant reloading and make reconstruction cheaper, Xquant-CL uses the accumulator that always holds the running sum of all past deltas and X₀ layer. As a result, for a new layer, it only needs the accumulator + the latest delta.
Even though XQuant-CL uses simple uniform quantization, it performs better than more complicated KV compression methods.
In pursuit of the best efficiency across different model types, researchers pushed XQuant and XQuant-CL’s capabilities further, and their next stop was the support of Grouped-Query Attention (GQA). Why is it so necessary?
XQuant with GQA: how it handles grouped-query attention
The standard attention type in LLMs is Multi-Head Attention (MHA) where every query head has its own Keys and Values. Many efficient models, such as Llama, Mistral, Gemma, Qwen, use Grouped-Query Attention (GQA) with several heads that share the same Keys and Values, which already reduces the memory size of the KV cache.
GQA is a great benefit for AI models, but plain XQuant has some issues with it. The input activations X are large, while Keys and Values in GQA are much smaller. Simply caching X, or even deltas in XQuant-CL, instead of KV could actually use more memory than standard KV caching. This issue just cancels the benefit of XQuant for GQA-based models, and it is not what we need.
To solve this problem, XQuant and XQuant-CL use Singular Value Decomposition (SVD) offline on the Key and Value projection matrices (Wk and Wv). Here is how it works:
XQuant and GQA-based models

Image Credit: XQuant original paper
Firstly, before inference (offline), the Key and Value weight matrices, Wk and Wv, are decomposed offline with SVD, creating low-dimensional subspaces, Uk and Uv.
During inference, the input X is down-projected into these subspaces, quantized, and cached.
When Keys and Values are needed, they’re rebuilt by multiplying with the fused smaller matrices.
An interesting detail: in the down-projected Key space, all outliers cluster in the first channel, which works good for effectiveness, while Values are best handled per-token.
With this method, the memory footprint now matches GQA’s KV cache size, but the down-projected X is easier to quantize than KV directly, so accuracy stays higher at the same bit width.
XQuant-CL and GQA-based models
For this variant, the Key and Value matrices are combined into one large matrix Wkv = [Wk | Wv], which is decomposed with SVD offline, creating a shared lower-dimensional subspace.
During inference, ΔX is down-projected into this subspace, quantized, and cached, similar to X in XQuant version.
When rebuilding X for a layer, the cached delta is up-projected, added to the running accumulator, and then multiplied with the fused Key/Value matrix to recover Keys and Values.
XQuant-CL for GQA-based models have the same benefits as XQuant version. Overall, since the heavy SVD step is done offline, there’s no extra runtime cost.
This is all about the inner side of the XQuant and XQuant-CL workflow, but what do we end up with?
XQuant benchmarks: memory savings and accuracy
From the very beginning of employing XQuant, we witness immediate gains – just by switching to storing X (one tensor per layer), memory use is cut in half compared to the normal KV cache.
Then better gains come with quantization. By compressing X into fewer bits, the memory savings grow up to ~7.7×, while model quality barely drops: <0.1 change in perplexity compared to FP16. The model stays almost as accurate as the original.

Image Credit: XQuant original paper
Going further, XQuant-CL version takes advantage of the fact that X values look similar across layers. It compresses even more, reaching up to 12.5× memory savings with very little accuracy loss.
As for the GQA models (Llama-3.1-8B, Mistral-7B), results for XQuant are the following:
4-bit: Only 0.01 perplexity drop with 3.7× memory savings.
3-bit: <0.1 perplexity drop with 5× memory savings.
2-bit: XQuant is still usable, beating KIVI* by 0.57 perplexity with 7.1× more memory savings.
So, summing up the advantages of all XQuant methods, we get →
The benefits of XQuant and XQuant-CL
Across datasets, XQuant and XQuant-CL consistently achieve huge memory savings (up to 12.5×) with really tiny accuracy tradeoffs.
Even though this adds a bit more computation, GPUs can handle it and the overall result is a faster and more efficient inference. These methods finally overcome the memory bottleneck.
They maintain near-FP16 accuracy even at very low precision.
XQuant-CL is more effective than basic XQuant, as it saves more memory and with the accumulator trick, it can rebuild any layer’s input quickly and cheaply.
SVD allows to use both methods with GQA, making them more applicable to the various models.
But after all that, a few problems dare to spoil the picture (a little bit).
XQuant limitations: when it may not help
These methods add extra computation for rematerializing Keys and Values on-the-fly every time attention is computed, which could increase latency on some hardware or platforms.
This also means that XQuant effectiveness generally depends on the hardware.
As for the XQuant-CL variant, it also requires managing an accumulator, adding some overhead. Plus, it is more complicated to implement and maintain than plain XQuant.
For GQA models both variants need offline SVD and workflow with down-projection, otherwise they risk increasing memory instead of saving it. Reconstruction is approximate, so quantization errors can build up.
Still, in memory-constrained scenarios where accuracy matters, XQuant offers an excellent balance: near-baseline accuracy with massive memory savings.
XQuant vs classic KV cache: when to make the switch
The big work by researchers from UC Berkeley, FuriosaAI, ICSI, and LBNL that we explored today gives us a new angle on dealing with memory limitations, avoiding the classical direct KV cache compression:
It compresses X (the input activations), which is smaller and easier to work with.
It uses cross-layer similarity in X to squeeze memory use even more.
We may say that familiar, time-tested approaches like KV caching are outdated, but they are still a solid starting point and a baseline for the current and future breakthrough. XQuant and XQuant-CL take a cleaner route to save more memory with fewer downsides. Which method to use depends on your specific needs – but XQuant is there when you are chasing high accuracy and your number one mission is saving memory at large scale.
Sources and further reading
Mastering LLM Techniques: Inference Optimization (NVIDIA blog)
Resources from Turing Post

