Papers
arxiv:2606.21906

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Published on Jun 20
· Submitted by
XUANMING ZHANG
on Jun 23
· Qwen Qwen
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal computational overhead.

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

Community

💡 Deeper is Not Always Better: Bypassing the "Alignment Tax" in LLMs
Standard practice assumes that the deeper a layer is in an autoregressive LLM, the more accurate its token representation becomes. In our latest collaborative research in Qwen Team, we prove this isn't always true.
Through an information-theoretic analysis of residual streams, we exposed a recurring Guess-Refine-Perturb phase structure in aligned models. While intermediate layers crystallize highly accurate logical and semantic reasoning, dense post-training alignment (e.g. RLHF or DPO) forces low-rank steering perturbations in the final layers. For complex scientific or mathematical problems, this causes an "Alignment Tax"—dragging pristine reasoning back toward generic, hyper-frequent filler words.
To solve this without retraining, we present Confident Decoding:

  • Entropy Valley Tracking: Uses an entropy-guided, conservative backward search to dynamically decode tokens at the peak of model confidence before late-stage steering conflicts arise.
  • Universal Efficacy: Tested across dense and MoE families (Qwen3.5, Gemma-4, gpt-oss), securing massive surges on frontier benchmarks—including up to a +22.4% jump on categorized Omni-MATH Level 4, +9.4% and +6.5% absolute improvement on LiveCodeBench and GPQA-Diamond, respectively.
  • Production Viability: Requires zero modification to the core forward pass or KV Cache. It functions natively inside high-throughput engines like vLLM with less than 2% wall-clock latency overhead.
    Optimizing where to stop internally inside the network opens up an entirely new vertical paradigm for test-time compute (TTC).

Paper: https://arxiv.org/pdf/2606.21906
Project: https://github.com/QwenLM/Confident-Decoding

Qwen3.7-Max/Plus is already live as a closed API — any plans for open-weight releases of the 3.7 family? (like 3.6-35B-A3B / 3.6-27B alongside 3.6-Max)

Would love to run it locally via llama.cpp / GGUF.

·

Absolutely will do.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.21906
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.21906 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.21906 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.21906 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.