Smart Routing as a Classification Problem

Engineering2026-03-224 min read

Picking the right model per request is fundamentally a classification task, and the training signal is already sitting in your logs. A survey of approaches to learned routing and why response feedback tends to beat hand-tuned rules.

Routing is already a classification problem

Every LLM request can be answered by one of several models in your pool. Some requests need the strongest model available. Most do not. The question is which requests fall into which bucket, and how to decide at inference time without running the strong model first.

This is a classification problem with a known feature space (prompt characteristics) and a recoverable training signal (response quality feedback from your logs). The signal is not perfect, but it is there and it is free.

What makes a good routing feature

The features you want are computable from the prompt itself in under 1ms, without running any model. Anything that requires a forward pass of a large model to compute is self-defeating.

routing features: compute cost vs. signal quality

Feature

Compute cost

Routing signal

Notes

Prompt token count

Negligible

Moderate

Longer prompts tend to need stronger models

Task type (regex)

Negligible

High for clear categories

Code fences, SQL keywords, etc.

Stop-word ratio

Negligible

Moderate

Low ratio signals technical/structured content

Perplexity (small LM)

~2ms on CPU

High

Low perplexity = familiar template = cheap model ok

Small classifier (128-dim)

~1ms on GPU

Very high

Best signal, requires labeled data

Embedding similarity to pool

~5ms on GPU

High

No labels needed, uses cached examples

In practice, a lightweight classifier trained on a few thousand labeled examples substantially outperforms rule-based routing. The rules approach captures obvious cases (code fences mean route to a code-specialized model) but misses the long tail.

Where the training signal comes from

You do not need a labeled dataset to start. You have a production log.

Every request that went to your strongest model has a response. Some of those responses would have been identical or equivalent if a cheaper model had handled them. You can identify which ones by running a sample of historical traffic through a cheaper model and comparing outputs.

mining training signal from logs (simplified)

for request in historical_requests:
    strong_response  = log[request]["response"]
    cheap_response   = run_cheap_model(request)
    quality_score    = evaluate(strong_response, cheap_response)
    # quality_score >= threshold -> label as "cheap model ok"
    # quality_score <  threshold -> label as "strong model required"

# Now you have a labeled dataset. Train a classifier on prompt features.

The evaluation step is the bottleneck. For tasks with a ground truth (code that compiles, SQL that runs, factual questions with verifiable answers), you can automate quality scoring. For open-ended generation, you need an LLM judge or human evaluation on a sample.

Three routing strategies compared

Hand-tuned rules

Fast to implement, fragile at the edges. Rules cover 60-70% of traffic well and mishandle the rest. They do not improve over time. Use this if you are starting from scratch and need something running today.

Trained classifier

A 3-layer MLP or logistic regression on engineered features, trained on a few thousand labeled examples, typically reaches 88-92% routing accuracy on held-out traffic. This is the practical optimum for most deployments. Training takes minutes, inference adds under 1ms, and accuracy improves as you add more labeled data.

Contextual bandit

A bandit continuously updates routing decisions based on reward signal (quality feedback). It handles distribution shift automatically and does not require an offline labeling step. The cost: it needs live feedback in the loop, adds implementation complexity, and takes several thousand requests to warm up.

routing strategy performance at different traffic levels

Strategy

Accuracy (1K req)

Accuracy (10K req)

Accuracy (100K req)

Maintenance

Hand-tuned rules

72%

Manual

Trained classifier

83%

91%

93%

Retrain monthly

Contextual bandit

61%

87%

94%

Automatic

The bandit starts slower because it needs to explore. By 100K requests it matches the trained classifier and handles distribution shifts the classifier would miss. For high-volume, long-running deployments, it is the right choice. For lower-volume or shorter-lived deployments, a trained classifier is better.

The compounding effect

Routing accuracy compounds with caching. When you route a request to the right model, you also route it to the right model's semantic cache. A request misrouted to the wrong model gets a cache lookup against the wrong model's history, reducing cache hit rate for both models.

In our measurements, improving routing accuracy from 72% (rules) to 91% (classifier) improved effective cache hit rate by 14 percentage points on top of the routing improvement itself. The two effects stack.

Getting started

Run your top 5 most expensive request types through a cheap model and compare the outputs. That experiment takes an afternoon and will tell you whether learned routing is worth building for your workload. If even 20% of your expensive requests could be handled by a cheaper model without quality loss, the ROI is immediate.

← previousPrompt Module Drift: The Hidden Cost of Prefix Caches

next →Schema-Safe Prompt Compression