Trimming tokens without losing meaning is the core problem behind nearly every cost-reduction story in production LLMs. A high-level framing of why it's harder than it looks and which research directions we find most promising.
The core problem
Most of the tokens in a production LLM prompt are not doing work. They are providing context that the model already inferred from other tokens, repeating information from earlier in the conversation, or phrasing instructions in three ways when one would do.
Context compilation is the process of reducing a prompt to its semantically necessary minimum before sending it to the model. The goal is not to summarize or paraphrase the original prompt. It is to remove the tokens that are not contributing to the answer.
Token categories in a production prompt
A useful decomposition of prompt tokens into four categories:
Category Share Compressible? ---------------------------------------------- Query / task intent 12% No Schema / constraints 18% Partially Retrieved context 31% Yes (dedup/trim) Conversation history 22% Yes (high) Repeated instructions 11% Yes (high) Formatting / padding 6% Yes (high) ---------------------------------------------- Total compressible ~52% At full compression
The query and task intent are irreducible. Schema and constraints are partially compressible (structural elements must be preserved; verbose explanation can be trimmed). The remaining categories are where the wins come from.
Conversation history is the biggest target
In multi-turn conversations, history accumulates fast. By turn 10 of a support conversation, the prompt may include 8-12K tokens of history for a 200-token question. Most of that history is resolved turns: the user asked a question, the assistant answered, the issue is closed.
Resolved turns can be compressed aggressively. If the user asked "What is your return policy?" and the assistant answered with the full policy, the compiled version might be: "User asked about return policy; assistant explained 30-day policy. Issue resolved."
history compression rates by turn type
Retrieved context: the deduplication case
RAG pipelines retrieve multiple documents, and those documents often share content. A knowledge base about a product may have a support article, a FAQ entry, and a product description that all cover the same feature from slightly different angles.
Context compilation deduplicates across retrieved chunks. The first document that describes a concept is kept. Subsequent documents are scanned for information not present in the first; only that new information is included.
chunks = retrieve(query, k=5)
compiled = []
seen_facts = set()
for chunk in ranked_by_relevance(chunks):
new_facts = extract_facts(chunk) - seen_facts
if new_facts:
compiled.append(summarize(chunk, keep=new_facts))
seen_facts |= new_facts
# compiled is typically 40-60% of total retrieved tokensMeasuring semantic preservation
The practical measurement challenge: how do you know if the compiled prompt produces equivalent answers to the original?
Two approaches are commonly used:
- Embedding distance: compute the embedding of the original prompt and the compiled prompt and measure cosine distance. This captures semantic drift at the prompt level but does not directly measure downstream quality.
- Answer comparison: run both prompts through the same model on a sample of requests and compare outputs using an LLM judge or automated metric. This directly measures quality impact but requires compute.
The embedding distance method is fast and can run in-line during compilation as a quality gate. Set a threshold (e.g., max cosine distance 0.05 from original), and if the compiled prompt exceeds it, fall back to the original. This prevents badly compiled prompts from reaching the model without adding significant latency.
What "compilation" gets you that summarization does not
Summarization rewrites the prompt. Compilation removes tokens while keeping the structure. The distinction matters because models are sensitive to prompt structure: instructions in a certain position, formatted in a certain way, get attended to differently than the same instructions reworded.
A compiled prompt is the original prompt with subtracted tokens. The remaining tokens are in their original positions, with their original formatting. This preserves attention patterns in a way that a summarized rewrite does not.