Small prompt sets mislead, replicates are usually missing, and dataset contamination is easy to miss. A methodology-focused write-up on how to compare direct API calls, native caches, and compiled pipelines in a way that survives scrutiny.
Three ways a benchmark lies to you
Benchmarking an LLM pipeline against a direct API call should be straightforward. It rarely is. The three most common failure modes are small prompt sets, missing replicates, and dataset contamination.
1. Small prompt sets
A benchmark with 50 prompts feels thorough if you wrote all 50 carefully. It is not. LLM outputs are stochastic and quality metrics are noisy. With 50 prompts and no replicates, a difference of 3 percentage points in quality score is statistically indistinguishable from random variation.
minimum sample size for reliable quality comparisons
Detecting a 2% quality difference, which is a real and meaningful improvement in production, requires at least 2,000 single-pass prompts. Most teams benchmark with 50-100 and confidently report results.
2. Missing replicates
Running each prompt once and comparing average scores hides temperature variance. At temperature 0.7, the same prompt can produce a correct answer on one run and an incorrect answer on another. A pipeline that looks 5% better might just have gotten lucky on this particular sample.
Replicates (running each prompt N times with temperature) let you estimate the variance per prompt. With 3 replicates per prompt, you can distinguish genuine quality improvement from variance. With 1, you cannot.
3. Dataset contamination
If your benchmark prompts are drawn from public datasets, and your LLM was trained on those datasets, the model's answers reflect training data memorization, not pipeline effectiveness. This is particularly problematic when benchmarking retrieval-augmented pipelines: a contaminated benchmark makes RAG look better than it is, because the model can answer without reading the documents.
for prompt in benchmark_prompts:
# Run WITHOUT retrieval context
answer_no_context = model(prompt)
# Run WITH retrieval context
answer_with_context = model(prompt + retrieved_docs)
if quality(answer_no_context) == quality(answer_with_context):
# Model may not be using the retrieved context.
# Either contaminated or context is not needed.
flag_for_review(prompt)Contaminated examples should be removed from your benchmark or replaced with proprietary examples drawn from your production traffic.
The comparison problem
Comparing "direct API path" to "optimized pipeline" is only valid if both paths receive identical inputs. This sounds obvious. It breaks in subtle ways.
When comparing a cache-enabled pipeline to a direct call, the cached path may receive the exact same prompt while the direct path uses a slightly different temperature or system prompt. The pipeline may add context that the direct call does not receive. The model version on the direct path may differ from the model version routed to by the pipeline.
common confounders in A/B benchmarks
A workable methodology
A benchmark setup that produces reliable, replicable results:
- Draw prompts from production logs, not public datasets. Minimum 1,000 prompts, stratified by workload type.
- Run each prompt 3 times on both the baseline and optimized paths with identical settings (model version, temperature, system prompt).
- Measure quality via a rubric-based LLM judge running at temperature 0, not via human evaluation on a small sample.
- Report confidence intervals, not point estimates. A quality improvement reported without error bars is not a result.
- Run the benchmark on a separate held-out set each time. Optimizing against a fixed benchmark inflates reported gains.
Cost benchmarking is simpler but also gets done wrong
Cost comparisons are more reliable than quality comparisons because token counts are deterministic. But cost benchmarks break when the prompt distribution used for benchmarking differs from production. A benchmark that over-represents long-form document queries and under-represents short conversational queries will show different cost savings than production.
Use your production traffic distribution, not a curated set. Sample 10,000 requests from the last 30 days of production logs, stratified by workload type, and run cost calculations on that. Anything less specific is a guess.