Modular attention reuse is fast, until a module shifts by one token and silently poisons downstream completions. A look at the boundary-stability problem and what the research community has proposed for versioning reusable prefixes.
How modular prefix caching works
Modular prompt caching splits a prompt into reusable segments: a system prompt, a static context chunk, a few-shot example block, and a query. Each segment is cached independently by hash. If the system prompt has not changed since the last request, you skip re-encoding it and load the cached key-value tensors directly.
This is legitimately useful. System prompts for production deployments are often 500-2000 tokens and change infrequently. Caching them across requests cuts the effective input token count and speeds up time-to-first-token substantially.
The drift problem
Here is the failure mode: a module that looks stable is not stable.
You are a helpful assistant for Acme Corp.
Today is {current_date}.
You must respond in formal English.
Customer plan: {customer_plan_tier}
...
[additional 800 tokens of instructions]The current_date field updates daily. Thecustomer_plan_tier updates whenever the customer upgrades. Either change invalidates the entire module hash, so the 800 tokens of static instructions get re-encoded on every cache miss.
The problem is not just efficiency loss. When the hash changes, any downstream modules that were cached conditioned on the previous version of this module are also stale. Their cached key-value tensors were computed with a different context prefix, so loading them is incorrect. The cascade looks like this:
Module A: system prompt [INVALIDATED - date changed]
|
+-- Module B: context chunk [STALE - depends on A]
|
+-- Module C: few-shot [STALE - depends on B]
|
+-- Query [must recompute everything]If Module A changes once per day, the cache hit rate on a 4-module prompt collapses from a theoretical 75% (modules B, C, query still match) to near zero because the prefix hash that modules B and C were cached under no longer exists in the cache.
Measuring the actual cost
cache hit rate vs. volatile element position (simulated, 10K requests/day)
A volatile element in the system prompt makes the cache nearly useless even though 95% of the prompt content is stable.
Three fixes, in order of preference
1. Separate stable from volatile content
The obvious fix: move volatile elements out of reusable modules. Instead of embedding the current date in the system prompt, pass it as part of the per-request query context. The 800 tokens of stable instructions stay in a module with a stable hash. The volatile date lives in the query, which is never cached anyway.
Module A (stable, cached): You are a helpful assistant for Acme Corp. You must respond in formal English. [800 tokens of static instructions] Per-request context (not cached): Today is 2026-03-28. Customer plan: enterprise
2. Content-addressed versioning
For modules that genuinely change but change rarely, version them explicitly. A module gets a version identifier (e.g., a semantic version or a content hash). Cached entries include the version. When a module updates, the old version stays in cache until it expires, so in-flight requests against the old version still hit. This trades cache storage for hit rate during transition periods.
3. Partial prefix reuse
Some inference frameworks support partial prefix reuse: if the first N tokens of a module match the cached version and only the last K tokens changed, reuse the first N positions and recompute only the changed suffix. This requires token-level cache granularity rather than module-level granularity, which is architecturally heavier but eliminates most of the cascade problem.
One thing the research mostly skips
Most papers on prefix caching measure hit rate on static, pre-defined prompt sets. Production prompts are not static. They drift: instructions get updated, context schemas change, new few-shot examples replace old ones. The cache management policy needs to account for drift, not just cold-start behavior.
The useful metric is not static hit rate but hit rate under realistic update frequency. If your system prompt updates three times a week, a cache with a 24-hour TTL and version isolation performs very differently than a cache that invalidates on any change.