Retrieval rankings are noisy, and any filter that drops weak evidence can quietly drop the one critical document on a bad ranking day. A look at the research on selective compression for RAG and why evaluation harnesses matter as much as the filters.
The filter that seems safe
You have a RAG pipeline. Your retriever returns 10 documents ranked by relevance. The top 3 are reliably good. Documents 4-10 are noisier, sometimes useful, sometimes irrelevant. The obvious optimization: filter to the top 3 and cut your context window usage by 70%.
The problem is that retrieval ranking is noisy, and "noise" has a specific failure pattern: the document that contains the one critical fact the query needs is often not the most semantically similar document to the query. It is the document that the user does not know they need.
Why relevance scoring fails edge cases
Semantic similarity is good at finding documents that discuss the same topic as the query. It is not good at finding documents that contain the specific exception, edge case, or policy provision that makes the answer to this particular query different from the general answer.
Query: "Can I return a product after 30 days if it arrived damaged?"
Top 3 retrieved documents:
1. Return policy overview (similarity: 0.91)
-> Discusses 30-day standard return window
2. How to initiate a return (similarity: 0.87)
-> Step-by-step return process
3. Refund timeline FAQ (similarity: 0.84)
-> Explains how long refunds take
Document 7 (similarity: 0.61):
Damaged goods policy
-> "Damaged-on-arrival items are eligible for return at any time
within 90 days, regardless of the standard return window."
Without document 7, the model answers: "Returns are accepted within 30 days."
With document 7, the model answers: "For damaged-on-arrival items, the
return window is 90 days."The correct document scored 0.61 because it is about damaged goods, not about returns generally. The query mentioned "damaged" but the retriever weighted "return" and "30 days" more heavily. This is normal retrieval behavior, not a bug.
Measuring the actual impact
accuracy by cutoff, customer support Q&A, 500 held-out queries with known ground truth
False confidence rate: the model gave a confident, wrong answer rather than saying it did not know. This is the dangerous outcome: a user trusts a confident-sounding incorrect answer.
Including all 10 documents cut false confidence from 29% to 12%. The cost was higher context usage and marginally longer latency. For most customer-facing applications, that tradeoff is straightforward.
What "selective augmentation" actually means
Selective augmentation is not about filtering out low-ranked documents. It is about deciding, per-query, whether additional context is likely to change the answer.
The query "What is the capital of France?" does not need 10 retrieved documents. The retriever confidence is high, the answer is unambiguous, and including 10 documents adds noise.
The query "Am I eligible for early termination under my contract given that I relocated internationally?" almost certainly needs every relevant document available, even poorly-ranked ones, because the answer depends on a specific provision the model cannot guess.
query_confidence = score_query_answerability(query)
# high confidence: factual, common, low-ambiguity query
# low confidence: policy-specific, exception-heavy, personal context
if query_confidence >= THRESHOLD:
context = top_k_docs(k=3)
else:
context = top_k_docs(k=10)
# or: context = all docs + a re-ranker passThe re-ranker partial solution
Re-ranking with a cross-encoder model (which scores a (query, document) pair jointly, rather than embedding them separately) substantially improves retrieval precision at small k. Cross-encoders have a much larger model capacity for relevance judgment than bi-encoders used in initial retrieval.
The cost: cross-encoders are 10-100x slower than bi-encoder retrieval. Re-ranking 10 documents with a cross-encoder adds 20-50ms of latency depending on model size. For latency-sensitive applications, this may be unacceptable. For accuracy-critical applications, it is usually worth it.
Evaluation harnesses are not optional
The silent regression risk with RAG filters is real: you deploy a cutoff optimization, your p50 latency improves, and you feel good about it. Meanwhile, your false confidence rate on edge-case queries has increased significantly and you have no metric tracking it.
An evaluation harness that checks model outputs against ground truth for a representative set of edge cases is the only way to catch this. Run it before and after any change that touches retrieval depth or filtering logic.