KV Cache Quantization: A Tour of the Trade-offs

Engineering2026-04-024 min read

Angular coding, JL projections, residual coding. Each family of KV quantization codecs shines under different constraints and breaks under others. A survey of the landscape for anyone weighing options for long-context inference.

Why KV cache memory is the bottleneck

In a transformer, every token in the context window generates key and value tensors at every attention layer. Those tensors are cached so you do not recompute them during autoregressive decoding. The cost grows fast: a 7B parameter model with 32 layers and 32 attention heads at 4K context, fp16, batch size 8, consumes roughly 800MB of KV cache. At 32K context that is 6.4GB for KV cache alone.

Quantizing the KV cache compresses those tensors. The tradeoff is always precision versus memory. What varies by approach is which precision you are losing and at what cost.

Three families of approaches

Angular coding

Angular coding quantizes the direction of key and value vectors while discarding or heavily quantizing the magnitude. The intuition: attention scores are computed via dot products, which depend on both direction and magnitude, but across long contexts the direction tends to carry more of the discriminative signal.

In practice, angular coding stores a normalized vector plus a scalar for magnitude, and quantizes the scalar aggressively (e.g., 4-bit) while keeping the direction at 8-bit or fp16. This is a good fit for long-context inference where the same key vectors are reused across many decoding steps and the relative attention pattern matters more than the absolute magnitude.

JL projections

Johnson-Lindenstrauss projections reduce the dimensionality of key/value tensors via random linear projections. The Johnson-Lindenstrauss lemma guarantees that pairwise distances between vectors are approximately preserved in the projected space with high probability, given a projection dimension that is logarithmic in the number of vectors.

The appeal: no training required. Pick a random projection matrix, project once, store the compressed representation. The compressed vectors are smaller, lookups are faster, and the error is bounded probabilistically. The downside is that JL projections are noisy for small sets of vectors. They shine when you have many long contexts and need aggressive memory reduction without task-specific tuning.

Residual coding

Residual coding quantizes KV vectors in stages. The first stage produces a coarse quantized approximation. The second stage quantizes the residual between the original vector and the first-stage reconstruction. Repeat until the residual is small enough to discard.

This is the approach used in vector database quantization (product quantization, additive quantization). Applied to KV caches it achieves the best quality-per-bit of the three families, but with higher compute cost during encoding. If you are running inference where latency is tight and the KV cache is written once and read many times, the encoding overhead amortizes. If you are encoding on every forward pass, it may not.

KV cache quantization: approach comparison

Approach

Compression ratio

Quality loss

Encoding cost

Best fit

Angular coding

4-6x

Low (direction preserved)

Very low

Long-context, repetitive queries

JL projection

3-8x (tunable)

Medium (probabilistic)

Negligible

Large-batch, memory-constrained

Residual coding

6-12x

Very low (controllable)

High

Quality-critical, read-heavy cache

Standard INT8

Minimal

Negligible

General baseline

Standard INT4

Moderate (task-dependent)

Negligible

General baseline

What actually happens to attention quality

The headline compression ratios are not the relevant metric. What matters is the effect on downstream task performance. The findings from the literature are consistent on a few points:

For tasks that depend on retrieving specific facts from long context (needle-in-a-haystack style), angular coding and JL projections degrade noticeably above 8x compression. The directional signal breaks down when the projected space gets too small.
For tasks with short to medium context (under 4K tokens), INT8 quantization of keys and values is effectively lossless on most benchmarks. There is no reason to use fancier approaches at that scale.
Residual coding holds quality well even at 10-12x compression, but the gains over INT4 plus a residual correction are marginal on most tasks and the extra complexity is usually not worth it outside of specialized deployments.

rough memory envelope at 32K context, 7B model, fp16 baseline

Method               Memory (GB)    vs baseline
fp16 (baseline)         6.4            --
INT8                    3.2          -50%
Angular coding          1.3          -80%
JL projection (dim/4)   1.6          -75%
Residual coding         0.6          -91%

Practical starting point

If you are not already running INT8 on your KV cache, start there. The implementation is straightforward, the quality loss is negligible for most workloads, and you get a free 2x memory reduction. Only move to fancier approaches if INT8 is not enough and you have measured the quality impact on your specific task distribution.

The interaction with eviction policy

Quantization and eviction are not independent. Heavy quantization with a bad eviction policy compounds errors: you are both approximating the tensors and throwing away the tokens that matter. The usual recommendation is to get your eviction policy right first (preserve high-attention tokens, see our note on KV eviction), then apply quantization on top. Trying to compensate for bad eviction with higher-fidelity quantization is expensive and ineffective.

← previousAdaptive Semantic Caching: One Threshold Isn't Enough

next →Prompt Module Drift: The Hidden Cost of Prefix Caches