Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving
Abstract
A cascaded approach for deploying large language models that balances accuracy and cost by routing queries to appropriate models based on clustering and quality estimation.
Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones. To address this challenge, we propose a two-stage cascaded solution. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model. The cost budget for this routing process is set by an interpretable hyperparameter, tuned offline. Stage 2 adds a quality estimation (QE) cascade; when an output from Stage 1 is judged low-quality, the query is escalated to a stronger model. This ensures only hard or low-confidence cases reach the expensive models. On the test datasets, the cascaded system retains 97-99% of the strongest model's accuracy while reducing Time Per Output Token (TPOT). It requires only task-correctness labels and adapts to changes in the model pool without manual reconfiguration.
Community
If you're serving LLMs in production, you know the trade-off. Strong models are slow and expensive, while efficient models struggle with hard queries. So we built a two-stage cascaded solution: the first stage routes each query to the most cost-effective model that can handle it, and the second stage escalates only the low-confidence outputs to a stronger model.
You don't need preference data or human feedback to do this; just the task-correctness labels you already have from standard eval. We're seeing 97-99% of the strongest model's accuracy at up to 18% lower latency. The multi-LLM routing system adapts on its own when you add or swap models in the pool.
Details on the system design, routing math, and the QE cascade are in the paper.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing (2026)
- HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools (2026)
- RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving (2026)
- INAR-VL: Input-Aware Routing for Edge–Cloud Vision–Language Inference (2026)
- The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers (2026)
- From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing (2026)
- Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.27457 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper