arxiv:2606.27457

Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving

Published on Jun 25

· Submitted by

Yasmin Moslem on Jun 29

Upvote

Authors:

Yasmin Moslem ,

Abstract

A cascaded approach for deploying large language models that balances accuracy and cost by routing queries to appropriate models based on clustering and quality estimation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones. To address this challenge, we propose a two-stage cascaded solution. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model. The cost budget for this routing process is set by an interpretable hyperparameter, tuned offline. Stage 2 adds a quality estimation (QE) cascade; when an output from Stage 1 is judged low-quality, the query is escalated to a stronger model. This ensures only hard or low-confidence cases reach the expensive models. On the test datasets, the cascaded system retains 97-99% of the strongest model's accuracy while reducing Time Per Output Token (TPOT). It requires only task-correctness labels and adapts to changes in the model pool without manual reconfiguration.

View arXiv page View PDF Add to collection

Community

ymoslem

Paper author Paper submitter 1 day ago

If you're serving LLMs in production, you know the trade-off. Strong models are slow and expensive, while efficient models struggle with hard queries. So we built a two-stage cascaded solution: the first stage routes each query to the most cost-effective model that can handle it, and the second stage escalates only the low-confidence outputs to a stronger model.

You don't need preference data or human feedback to do this; just the task-correctness labels you already have from standard eval. We're seeing 97-99% of the strongest model's accuracy at up to 18% lower latency. The multi-LLM routing system adapts on its own when you add or swap models in the pool.

Details on the system design, routing math, and the QE cascade are in the paper.

librarian-bot

about 15 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.27457

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.27457 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.27457 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.27457 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.