← all notes

Rethinking KV Eviction for Attention Caches

Engineering4 min read

LRU is the wrong default for attention caches. A read of the research on attention-aware eviction, and why preserving the tokens models actually look at can compound memory savings across decoder layers on long-context workloads.

Why LRU is the wrong default

Least recently used eviction makes sense for a filesystem cache or a CPU cache, where recent access is a good proxy for future access. For attention KV caches, it is the wrong heuristic.

In transformer attention, not all tokens are attended to equally. A few tokens in any context receive disproportionate attention weight: the first token (often a BOS token or system instruction anchor), tokens containing key facts the current generation step depends on, and tokens that appear in many attention heads simultaneously. These "heavy hitter" tokens are accessed constantly, not recently.

LRU has no concept of attention weight. It tracks time of last access. A heavy hitter token that was last accessed two steps ago is treated the same as a token that has never been attended to. When the cache fills and LRU evicts, it will sometimes evict tokens that every subsequent generation step needs.

Attention-aware eviction

The core idea: instead of evicting the least recently used token, evict the token with the lowest cumulative attention weight across all heads and all decoding steps so far.

LRU vs attention-aware eviction comparison
LRU: evict the token accessed least recently
  - Ignores attention weight entirely
  - Fast to implement
  - Catastrophic when the "stale" token is a key fact

Attention-aware (H2O / similar):
  - Track per-token accumulated attention weight
  - Evict the token with lowest accumulated weight
  - Preserve "heavy hitter" tokens regardless of recency
  - Cost: O(n) bookkeeping per decoding step

The H2O (Heavy Hitter Oracle) paper from 2023 formalized this approach and showed that preserving only 5-20% of the KV cache using attention weight guidance maintained 95%+ of model quality on most tasks. LRU at the same cache budget showed substantially larger quality drops.

The attention sink phenomenon

One consistent finding: the first 1-4 tokens of any context receive massive accumulated attention weight regardless of their content. This "attention sink" appears to be an artifact of how transformers are trained: the BOS token acts as a garbage collector for attention when no meaningful target is available.

Any attention-aware eviction policy needs to permanently preserve these sink tokens, or quality collapses immediately. The fix is simple: mark the first N tokens as eviction-protected and apply the eviction policy to the rest.

eviction policy comparison (7B model, 32K context, 10% cache budget)

Policy
MMLU accuracy
Long-context retrieval
Memory vs full KV
Full KV (no eviction)
100% (baseline)
100%
100%
LRU
71%
58%
10%
Random eviction
68%
62%
10%
H2O (attention-aware)
94%
89%
10%
H2O + attention sink guard
97%
93%
10%
ScissorHands
96%
91%
10%

At 10% cache budget (keeping only 1 in 10 KV entries), attention-aware eviction retains 94-97% of baseline quality. LRU retains 71%. For most production workloads, that gap is the difference between acceptable degradation and user-visible quality loss.

ScissorHands and the locality observation

ScissorHands (Dai et al., 2023) makes an additional observation: if a token was important for a previous attention head, it tends to be important for subsequent heads. This locality property means you can amortize the eviction decision across heads rather than maintaining separate rankings per head.

The practical effect: ScissorHands requires less bookkeeping than per-head attention tracking while achieving similar quality, which reduces the per-step overhead of maintaining the eviction state.

When does it actually matter

Attention-aware eviction is relevant when:

  • You are serving long-context requests (8K+ tokens) and memory is a constraint.
  • You are batching multiple requests and need to fit more sequences per GPU.
  • You are implementing a sliding window over a very long document and cannot hold the full KV cache.

For short-context requests (under 4K tokens), the KV cache is small enough that eviction is rarely necessary. The complexity is not worth it.

The interaction with quantization
Eviction and quantization compound positively: apply attention-aware eviction first to determine which tokens to keep, then quantize the retained KV cache. Applying quantization before eviction and then running LRU wastes precision on tokens you will discard anyway.

Implementation overhead

The main cost of attention-aware eviction is maintaining the cumulative attention score per token. At each decoding step, you read the current attention weights (already computed as part of the forward pass), add them to a running sum per token, and update the eviction candidate ranking.

On a 7B model with 32 layers and 32 heads, maintaining this bookkeeping adds approximately 3-5% overhead to the decoding step. This is usually acceptable given the memory savings, but measure it on your hardware before deploying, because the tradeoff changes at different batch sizes and context lengths.