Research and engineering notes
Surveys, deep-dives, and rough drafts. Ask if you want a long-form version early.
- Research1 min read
Adaptive Semantic Caching: One Threshold Isn't Enough
A single global similarity cutoff is a blunt instrument across model families and workloads. A read of the recent literature on adaptive, per-embedding reuse bands: what the research suggests, and where one-size-fits-all thresholds tend to break.
read note → - Engineering1 min read
KV Cache Quantization: A Tour of the Trade-offs
Angular coding, JL projections, residual coding. Each family of KV quantization codecs shines under different constraints and breaks under others. A survey of the landscape for anyone weighing options for long-context inference.
read note → - Engineering1 min read
Prompt Module Drift: The Hidden Cost of Prefix Caches
Modular attention reuse is fast, until a module shifts by one token and silently poisons downstream completions. A look at the boundary-stability problem and what the research community has proposed for versioning reusable prefixes.
read note → - Engineering1 min read
Smart Routing as a Classification Problem
Picking the right model per request is fundamentally a classification task, and the training signal is already sitting in your logs. A survey of approaches to learned routing and why response feedback tends to beat hand-tuned rules.
read note → - Research1 min read
Schema-Safe Prompt Compression
Query-aware pruning works well on free text and degrades predictably on structured prompts. Notes on why entity and schema preservation often matter more than raw reduction ratios, and what the research suggests about measuring both.
read note → - Product1 min read
Context Compilation: Framing the Problem
Trimming tokens without losing meaning is the core problem behind nearly every cost-reduction story in production LLMs. A high-level framing of why it's harder than it looks and which research directions we find most promising.
read note → - Product1 min read
Selective Augmentation: Shipping RAG Without Silent Regressions
Retrieval rankings are noisy, and any filter that drops weak evidence can quietly drop the one critical document on a bad ranking day. A look at the research on selective compression for RAG and why evaluation harnesses matter as much as the filters.
read note → - Research1 min read
Benchmarking LLM Pipelines Without Fooling Yourself
Small prompt sets mislead, replicates are usually missing, and dataset contamination is easy to miss. A methodology-focused write-up on how to compare direct API calls, native caches, and compiled pipelines in a way that survives scrutiny.
read note → - Engineering1 min read
Rethinking KV Eviction for Attention Caches
LRU is the wrong default for attention caches. A read of the research on attention-aware eviction, and why preserving the tokens models actually look at can compound memory savings across decoder layers on long-context workloads.
read note →