Picking the right model per request is fundamentally a classification task, and the training signal is already sitting in your logs. A survey of approaches to learned routing and why response feedback tends to beat hand-tuned rules.
Routing is already a classification problem
Every LLM request can be answered by one of several models in your pool. Some requests need the strongest model available. Most do not. The question is which requests fall into which bucket, and how to decide at inference time without running the strong model first.
This is a classification problem with a known feature space (prompt characteristics) and a recoverable training signal (response quality feedback from your logs). The signal is not perfect, but it is there and it is free.
What makes a good routing feature
The features you want are computable from the prompt itself in under 1ms, without running any model. Anything that requires a forward pass of a large model to compute is self-defeating.
routing features: compute cost vs. signal quality
In practice, a lightweight classifier trained on a few thousand labeled examples substantially outperforms rule-based routing. The rules approach captures obvious cases (code fences mean route to a code-specialized model) but misses the long tail.
Where the training signal comes from
You do not need a labeled dataset to start. You have a production log.
Every request that went to your strongest model has a response. Some of those responses would have been identical or equivalent if a cheaper model had handled them. You can identify which ones by running a sample of historical traffic through a cheaper model and comparing outputs.
for request in historical_requests:
strong_response = log[request]["response"]
cheap_response = run_cheap_model(request)
quality_score = evaluate(strong_response, cheap_response)
# quality_score >= threshold -> label as "cheap model ok"
# quality_score < threshold -> label as "strong model required"
# Now you have a labeled dataset. Train a classifier on prompt features.The evaluation step is the bottleneck. For tasks with a ground truth (code that compiles, SQL that runs, factual questions with verifiable answers), you can automate quality scoring. For open-ended generation, you need an LLM judge or human evaluation on a sample.
Three routing strategies compared
Hand-tuned rules
Fast to implement, fragile at the edges. Rules cover 60-70% of traffic well and mishandle the rest. They do not improve over time. Use this if you are starting from scratch and need something running today.
Trained classifier
A 3-layer MLP or logistic regression on engineered features, trained on a few thousand labeled examples, typically reaches 88-92% routing accuracy on held-out traffic. This is the practical optimum for most deployments. Training takes minutes, inference adds under 1ms, and accuracy improves as you add more labeled data.
Contextual bandit
A bandit continuously updates routing decisions based on reward signal (quality feedback). It handles distribution shift automatically and does not require an offline labeling step. The cost: it needs live feedback in the loop, adds implementation complexity, and takes several thousand requests to warm up.
routing strategy performance at different traffic levels
The bandit starts slower because it needs to explore. By 100K requests it matches the trained classifier and handles distribution shifts the classifier would miss. For high-volume, long-running deployments, it is the right choice. For lower-volume or shorter-lived deployments, a trained classifier is better.
The compounding effect
Routing accuracy compounds with caching. When you route a request to the right model, you also route it to the right model's semantic cache. A request misrouted to the wrong model gets a cache lookup against the wrong model's history, reducing cache hit rate for both models.
In our measurements, improving routing accuracy from 72% (rules) to 91% (classifier) improved effective cache hit rate by 14 percentage points on top of the routing improvement itself. The two effects stack.