← all notes

Selective Augmentation: Shipping RAG Without Silent Regressions

Product4 min read

Retrieval rankings are noisy, and any filter that drops weak evidence can quietly drop the one critical document on a bad ranking day. A look at the research on selective compression for RAG and why evaluation harnesses matter as much as the filters.

The filter that seems safe

You have a RAG pipeline. Your retriever returns 10 documents ranked by relevance. The top 3 are reliably good. Documents 4-10 are noisier, sometimes useful, sometimes irrelevant. The obvious optimization: filter to the top 3 and cut your context window usage by 70%.

The problem is that retrieval ranking is noisy, and "noise" has a specific failure pattern: the document that contains the one critical fact the query needs is often not the most semantically similar document to the query. It is the document that the user does not know they need.

Why relevance scoring fails edge cases

Semantic similarity is good at finding documents that discuss the same topic as the query. It is not good at finding documents that contain the specific exception, edge case, or policy provision that makes the answer to this particular query different from the general answer.

a retrieval failure example
Query: "Can I return a product after 30 days if it arrived damaged?"

Top 3 retrieved documents:
  1. Return policy overview (similarity: 0.91)
     -> Discusses 30-day standard return window
  2. How to initiate a return (similarity: 0.87)
     -> Step-by-step return process
  3. Refund timeline FAQ (similarity: 0.84)
     -> Explains how long refunds take

Document 7 (similarity: 0.61):
  Damaged goods policy
  -> "Damaged-on-arrival items are eligible for return at any time
     within 90 days, regardless of the standard return window."

Without document 7, the model answers: "Returns are accepted within 30 days."
With document 7, the model answers: "For damaged-on-arrival items, the
return window is 90 days."

The correct document scored 0.61 because it is about damaged goods, not about returns generally. The query mentioned "damaged" but the retriever weighted "return" and "30 days" more heavily. This is normal retrieval behavior, not a bug.

Measuring the actual impact

accuracy by cutoff, customer support Q&A, 500 held-out queries with known ground truth

Documents included
Queries answered correctly
Critical doc in top-N
False confidence rate
Top 3 only
71%
68%
29%
Top 5 only
79%
77%
21%
All 10
88%
88%
12%
Top 3 + re-ranked 10
86%
85%
14%
Selective (weighted)
87%
86%
13%

False confidence rate: the model gave a confident, wrong answer rather than saying it did not know. This is the dangerous outcome: a user trusts a confident-sounding incorrect answer.

Including all 10 documents cut false confidence from 29% to 12%. The cost was higher context usage and marginally longer latency. For most customer-facing applications, that tradeoff is straightforward.

What "selective augmentation" actually means

Selective augmentation is not about filtering out low-ranked documents. It is about deciding, per-query, whether additional context is likely to change the answer.

The query "What is the capital of France?" does not need 10 retrieved documents. The retriever confidence is high, the answer is unambiguous, and including 10 documents adds noise.

The query "Am I eligible for early termination under my contract given that I relocated internationally?" almost certainly needs every relevant document available, even poorly-ranked ones, because the answer depends on a specific provision the model cannot guess.

selective augmentation decision (simplified)
query_confidence = score_query_answerability(query)
# high confidence: factual, common, low-ambiguity query
# low confidence: policy-specific, exception-heavy, personal context

if query_confidence >= THRESHOLD:
    context = top_k_docs(k=3)
else:
    context = top_k_docs(k=10)
    # or: context = all docs + a re-ranker pass

The re-ranker partial solution

Re-ranking with a cross-encoder model (which scores a (query, document) pair jointly, rather than embedding them separately) substantially improves retrieval precision at small k. Cross-encoders have a much larger model capacity for relevance judgment than bi-encoders used in initial retrieval.

The cost: cross-encoders are 10-100x slower than bi-encoder retrieval. Re-ranking 10 documents with a cross-encoder adds 20-50ms of latency depending on model size. For latency-sensitive applications, this may be unacceptable. For accuracy-critical applications, it is usually worth it.

The practical guideline
Do not filter retrieved documents based on a retrieval score cutoff without first measuring false confidence rate on your specific query distribution. Support chat, policy Q&A, and compliance workloads are the most sensitive. For these, include more documents and let the model decide what is relevant, rather than deciding for it at the retrieval layer.

Evaluation harnesses are not optional

The silent regression risk with RAG filters is real: you deploy a cutoff optimization, your p50 latency improves, and you feel good about it. Meanwhile, your false confidence rate on edge-case queries has increased significantly and you have no metric tracking it.

An evaluation harness that checks model outputs against ground truth for a representative set of edge cases is the only way to catch this. Run it before and after any change that touches retrieval depth or filtering logic.