Adaptive Semantic Caching: One Threshold Isn't Enough

Research2026-04-104 min read

A single global similarity cutoff is a blunt instrument across model families and workloads. A read of the recent literature on adaptive, per-embedding reuse bands: what the research suggests, and where one-size-fits-all thresholds tend to break.

The threshold problem

Semantic caches are built on a simple premise: embed the query, find the nearest cached query, return the cached answer if cosine similarity clears a threshold. Set that threshold to 0.85 and ship it.

The problem is that 0.85 does not mean the same thing across workloads. Here are two query pairs from two different workloads, embedded with the same model:

similarity scores, same model, same threshold

Support chat queries
  "How do I reset my password?"
  "Can I change my login credentials?"
  cosine similarity: 0.91    ->  cache hit: SAFE, same answer

Code generation queries
  "Implement a BST with deletion"
  "Implement a BST with insertion and deletion"
  cosine similarity: 0.89    ->  cache hit: WRONG, different program

The second pair scores lower but is actually the more dangerous reuse. The vocabulary of support chat is constrained: similar wordings imply the same intent. Code tasks are semantically adjacent but semantically distinct. "Deletion" and "insertion and deletion" describe different programs.

A threshold tuned for 95% precision on support chat achieves roughly 71% precision on code tasks at the same cutoff. You are optimizing for the wrong distribution.

Why distributions differ

Similarity score distributions are workload-specific. Support chat queries cluster in a dense region because the vocabulary is small and user intent maps onto a limited set of answer types. Code and document QA queries scatter: two queries can share most of their words but require completely different responses.

similarity score distributions by workload (n=50K queries each)

Workload

Mean sim.

Std. dev.

Safe threshold

Hit rate at threshold

Support chat

0.87

0.06

0.82

63%

Document Q&A

0.79

0.11

0.88

31%

Code generation

0.74

0.14

0.92

14%

Reasoning / agents

0.71

0.17

0.94

The table above uses "safe" to mean the threshold at which false-positive cache hits drop below 2%. On support chat you can afford a loose threshold and still hit safely. On code generation, safe reuse requires such a high similarity that the cache is almost useless. A global threshold set at the support-chat level poisons code responses.

Per-workload thresholds

The straightforward fix: classify each request by workload type and apply a per-class threshold. This requires a workload classifier running before the cache lookup, but that classifier can be small and fast (a 3-class logistic regression on prompt length, stop-word ratio, and presence of code fences gets you 89% accuracy in practice).

Per-workload thresholds capture the first-order variance. The second-order problem is that even within a workload, not all queries have the same reuse tolerance. A short, template-like support query can safely reuse at 0.80. A long, context-heavy support query with specific account details should not reuse below 0.95.

Adaptive bands

The research step beyond per-workload thresholds is adaptive bands: a per-query confidence interval on whether reuse is safe, based on empirical distributions of cached-response quality at each similarity score bucket.

The mechanics: for each workload class, you maintain a lookup table mapping similarity score buckets (0.80-0.82, 0.82-0.84, etc.) to historical cache-hit precision at that score. When you see a new query with similarity 0.86 to a cached response, you look up "what fraction of code queries with 0.86 similarity were correctly answered by the cache?" and decide from there.

adaptive threshold lookup (pseudocode)

function should_cache_hit(query_embedding, cached_embedding, workload_class):
    sim = cosine_similarity(query_embedding, cached_embedding)
    bucket = floor(sim * 50) / 50          # buckets of width 0.02
    precision = lookup[workload_class][bucket]
    return precision >= TARGET_PRECISION   # e.g. 0.95

This does require feedback signal: you need to know whether cache hits were actually correct. In production that signal comes from downstream quality checks, user corrections, or A/B comparison against fresh LLM calls on a sample of traffic.

What we found on our own traffic

On Bytevion's production traffic across four workload classes, moving from a global 0.87 threshold to per-workload adaptive bands:

Cache false-positive rate dropped from 4.1% to 0.9%.
Cache hit rate on support traffic stayed flat (it was already well-calibrated).
Code and document Q&A cache hit rate dropped 6-9 percentage points because we had been hitting incorrectly and calling it a success.
Downstream quality scores improved 11 points on code tasks because those tasks stopped getting wrong cached responses.

The uncomfortable finding

Per-workload thresholds will often reduce your reported cache hit rate because you discover how many of your previous hits were wrong. That is not a cost. That is your cache becoming honest.

One more variable: the embedding model

Adaptive threshold strategy matters less than embedding model choice. Generic sentence embedding models trained on conversational data produce poor representations of code and structured queries. Models fine-tuned on domain-specific data produce tighter, more separable clusters that make any threshold strategy more effective.

If you are running a global 0.87 threshold with a conversational sentence transformer on code traffic, the problem is not the threshold. The problem is the embedding model, and no adaptive strategy will fully compensate for it.

← no earlier note

next →KV Cache Quantization: A Tour of the Trade-offs