← all notes

Context Compilation: Framing the Problem

Product4 min read

Trimming tokens without losing meaning is the core problem behind nearly every cost-reduction story in production LLMs. A high-level framing of why it's harder than it looks and which research directions we find most promising.

The core problem

Most of the tokens in a production LLM prompt are not doing work. They are providing context that the model already inferred from other tokens, repeating information from earlier in the conversation, or phrasing instructions in three ways when one would do.

Context compilation is the process of reducing a prompt to its semantically necessary minimum before sending it to the model. The goal is not to summarize or paraphrase the original prompt. It is to remove the tokens that are not contributing to the answer.

Token categories in a production prompt

A useful decomposition of prompt tokens into four categories:

prompt token budget breakdown (production average, n=250K requests)
Category               Share    Compressible?
----------------------------------------------
Query / task intent     12%      No
Schema / constraints    18%      Partially
Retrieved context       31%      Yes (dedup/trim)
Conversation history    22%      Yes (high)
Repeated instructions   11%      Yes (high)
Formatting / padding     6%      Yes (high)
----------------------------------------------
Total compressible      ~52%     At full compression

The query and task intent are irreducible. Schema and constraints are partially compressible (structural elements must be preserved; verbose explanation can be trimmed). The remaining categories are where the wins come from.

Conversation history is the biggest target

In multi-turn conversations, history accumulates fast. By turn 10 of a support conversation, the prompt may include 8-12K tokens of history for a 200-token question. Most of that history is resolved turns: the user asked a question, the assistant answered, the issue is closed.

Resolved turns can be compressed aggressively. If the user asked "What is your return policy?" and the assistant answered with the full policy, the compiled version might be: "User asked about return policy; assistant explained 30-day policy. Issue resolved."

history compression rates by turn type

Turn type
Original tokens
Compiled tokens
Semantic loss
Resolved Q&A
~400
~30
< 1%
Unresolved issue
~400
~200
< 2%
Clarification exchange
~300
~80
< 1%
Active task context
~600
~500
< 0.5%

Retrieved context: the deduplication case

RAG pipelines retrieve multiple documents, and those documents often share content. A knowledge base about a product may have a support article, a FAQ entry, and a product description that all cover the same feature from slightly different angles.

Context compilation deduplicates across retrieved chunks. The first document that describes a concept is kept. Subsequent documents are scanned for information not present in the first; only that new information is included.

deduplication pass (conceptual)
chunks = retrieve(query, k=5)
compiled = []
seen_facts = set()

for chunk in ranked_by_relevance(chunks):
    new_facts = extract_facts(chunk) - seen_facts
    if new_facts:
        compiled.append(summarize(chunk, keep=new_facts))
        seen_facts |= new_facts

# compiled is typically 40-60% of total retrieved tokens

Measuring semantic preservation

The practical measurement challenge: how do you know if the compiled prompt produces equivalent answers to the original?

Two approaches are commonly used:

  • Embedding distance: compute the embedding of the original prompt and the compiled prompt and measure cosine distance. This captures semantic drift at the prompt level but does not directly measure downstream quality.
  • Answer comparison: run both prompts through the same model on a sample of requests and compare outputs using an LLM judge or automated metric. This directly measures quality impact but requires compute.

The embedding distance method is fast and can run in-line during compilation as a quality gate. Set a threshold (e.g., max cosine distance 0.05 from original), and if the compiled prompt exceeds it, fall back to the original. This prevents badly compiled prompts from reaching the model without adding significant latency.

The hardest part
Context compilation fails silently when it drops a critical detail that appears in only one place. A user who mentioned their account number three turns ago, once, may have that information stripped during history compression. The compiled prompt looks reasonable, the model answers confidently, and the answer is wrong because it operates on an assumed account. Systematic testing with edge cases is not optional.

What "compilation" gets you that summarization does not

Summarization rewrites the prompt. Compilation removes tokens while keeping the structure. The distinction matters because models are sensitive to prompt structure: instructions in a certain position, formatted in a certain way, get attended to differently than the same instructions reworded.

A compiled prompt is the original prompt with subtracted tokens. The remaining tokens are in their original positions, with their original formatting. This preserves attention patterns in a way that a summarized rewrite does not.