A routing layer in front of your models can cut run-cost 40–70% without measurable quality loss when designed honestly. The trick is to know what "without quality loss" actually means for your task.
Three routing strategies
- Static by task — Easy classifications go to the small model, generative work goes to the large one. Cheapest to implement, decent ceiling.
- Cascade — Try the small model first. Evaluate the result (cheap check or self-grade). If quality below threshold, escalate to the large model. Higher ceiling, more complex.
- Confidence-based — Have the small model emit a confidence score with the answer. Escalate when low. Requires calibration but very efficient.
What 'no quality loss' really means
Decide your acceptable regression budget before you build the router. "We can lose up to 2 points on the eval suite to halve the cost" is a real number. "It should be just as good" is not. Measure both versions against the same eval and write the trade-off down.
Knowledge check
0/1 answered1. Which routing strategy gives the best efficiency with proper calibration?
Discussion
0 commentsBe the first to start the conversation.