Angular coding, JL projections, residual coding. Each family of KV quantization codecs shines under different constraints and breaks under others. A survey of the landscape for anyone weighing options for long-context inference.
Why KV cache memory is the bottleneck
In a transformer, every token in the context window generates key and value tensors at every attention layer. Those tensors are cached so you do not recompute them during autoregressive decoding. The cost grows fast: a 7B parameter model with 32 layers and 32 attention heads at 4K context, fp16, batch size 8, consumes roughly 800MB of KV cache. At 32K context that is 6.4GB for KV cache alone.
Quantizing the KV cache compresses those tensors. The tradeoff is always precision versus memory. What varies by approach is which precision you are losing and at what cost.
Three families of approaches
Angular coding
Angular coding quantizes the direction of key and value vectors while discarding or heavily quantizing the magnitude. The intuition: attention scores are computed via dot products, which depend on both direction and magnitude, but across long contexts the direction tends to carry more of the discriminative signal.
In practice, angular coding stores a normalized vector plus a scalar for magnitude, and quantizes the scalar aggressively (e.g., 4-bit) while keeping the direction at 8-bit or fp16. This is a good fit for long-context inference where the same key vectors are reused across many decoding steps and the relative attention pattern matters more than the absolute magnitude.
JL projections
Johnson-Lindenstrauss projections reduce the dimensionality of key/value tensors via random linear projections. The Johnson-Lindenstrauss lemma guarantees that pairwise distances between vectors are approximately preserved in the projected space with high probability, given a projection dimension that is logarithmic in the number of vectors.
The appeal: no training required. Pick a random projection matrix, project once, store the compressed representation. The compressed vectors are smaller, lookups are faster, and the error is bounded probabilistically. The downside is that JL projections are noisy for small sets of vectors. They shine when you have many long contexts and need aggressive memory reduction without task-specific tuning.
Residual coding
Residual coding quantizes KV vectors in stages. The first stage produces a coarse quantized approximation. The second stage quantizes the residual between the original vector and the first-stage reconstruction. Repeat until the residual is small enough to discard.
This is the approach used in vector database quantization (product quantization, additive quantization). Applied to KV caches it achieves the best quality-per-bit of the three families, but with higher compute cost during encoding. If you are running inference where latency is tight and the KV cache is written once and read many times, the encoding overhead amortizes. If you are encoding on every forward pass, it may not.
KV cache quantization: approach comparison
What actually happens to attention quality
The headline compression ratios are not the relevant metric. What matters is the effect on downstream task performance. The findings from the literature are consistent on a few points:
- For tasks that depend on retrieving specific facts from long context (needle-in-a-haystack style), angular coding and JL projections degrade noticeably above 8x compression. The directional signal breaks down when the projected space gets too small.
- For tasks with short to medium context (under 4K tokens), INT8 quantization of keys and values is effectively lossless on most benchmarks. There is no reason to use fancier approaches at that scale.
- Residual coding holds quality well even at 10-12x compression, but the gains over INT4 plus a residual correction are marginal on most tasks and the extra complexity is usually not worth it outside of specialized deployments.
Method Memory (GB) vs baseline fp16 (baseline) 6.4 -- INT8 3.2 -50% Angular coding 1.3 -80% JL projection (dim/4) 1.6 -75% Residual coding 0.6 -91%
The interaction with eviction policy
Quantization and eviction are not independent. Heavy quantization with a bad eviction policy compounds errors: you are both approximating the tensors and throwing away the tokens that matter. The usual recommendation is to get your eviction policy right first (preserve high-attention tokens, see our note on KV eviction), then apply quantization on top. Trying to compensate for bad eviction with higher-fidelity quantization is expensive and ineffective.