LRU is the wrong default for attention caches. A read of the research on attention-aware eviction, and why preserving the tokens models actually look at can compound memory savings across decoder layers on long-context workloads.
Why LRU is the wrong default
Least recently used eviction makes sense for a filesystem cache or a CPU cache, where recent access is a good proxy for future access. For attention KV caches, it is the wrong heuristic.
In transformer attention, not all tokens are attended to equally. A few tokens in any context receive disproportionate attention weight: the first token (often a BOS token or system instruction anchor), tokens containing key facts the current generation step depends on, and tokens that appear in many attention heads simultaneously. These "heavy hitter" tokens are accessed constantly, not recently.
LRU has no concept of attention weight. It tracks time of last access. A heavy hitter token that was last accessed two steps ago is treated the same as a token that has never been attended to. When the cache fills and LRU evicts, it will sometimes evict tokens that every subsequent generation step needs.
Attention-aware eviction
The core idea: instead of evicting the least recently used token, evict the token with the lowest cumulative attention weight across all heads and all decoding steps so far.
LRU: evict the token accessed least recently - Ignores attention weight entirely - Fast to implement - Catastrophic when the "stale" token is a key fact Attention-aware (H2O / similar): - Track per-token accumulated attention weight - Evict the token with lowest accumulated weight - Preserve "heavy hitter" tokens regardless of recency - Cost: O(n) bookkeeping per decoding step
The H2O (Heavy Hitter Oracle) paper from 2023 formalized this approach and showed that preserving only 5-20% of the KV cache using attention weight guidance maintained 95%+ of model quality on most tasks. LRU at the same cache budget showed substantially larger quality drops.
The attention sink phenomenon
One consistent finding: the first 1-4 tokens of any context receive massive accumulated attention weight regardless of their content. This "attention sink" appears to be an artifact of how transformers are trained: the BOS token acts as a garbage collector for attention when no meaningful target is available.
Any attention-aware eviction policy needs to permanently preserve these sink tokens, or quality collapses immediately. The fix is simple: mark the first N tokens as eviction-protected and apply the eviction policy to the rest.
eviction policy comparison (7B model, 32K context, 10% cache budget)
At 10% cache budget (keeping only 1 in 10 KV entries), attention-aware eviction retains 94-97% of baseline quality. LRU retains 71%. For most production workloads, that gap is the difference between acceptable degradation and user-visible quality loss.
ScissorHands and the locality observation
ScissorHands (Dai et al., 2023) makes an additional observation: if a token was important for a previous attention head, it tends to be important for subsequent heads. This locality property means you can amortize the eviction decision across heads rather than maintaining separate rankings per head.
The practical effect: ScissorHands requires less bookkeeping than per-head attention tracking while achieving similar quality, which reduces the per-step overhead of maintaining the eviction state.
When does it actually matter
Attention-aware eviction is relevant when:
- You are serving long-context requests (8K+ tokens) and memory is a constraint.
- You are batching multiple requests and need to fit more sequences per GPU.
- You are implementing a sliding window over a very long document and cannot hold the full KV cache.
For short-context requests (under 4K tokens), the KV cache is small enough that eviction is rarely necessary. The complexity is not worth it.
Implementation overhead
The main cost of attention-aware eviction is maintaining the cumulative attention score per token. At each decoding step, you read the current attention weights (already computed as part of the forward pass), add them to a running sum per token, and update the eviction candidate ranking.
On a 7B model with 32 layers and 32 heads, maintaining this bookkeeping adds approximately 3-5% overhead to the decoding step. This is usually acceptable given the memory savings, but measure it on your hardware before deploying, because the tradeoff changes at different batch sizes and context lengths.