Qwen3.6-35B-A3B Claude 4.7 Opus Distill — MTP-Enabled GGUF

MTP-enabled GGUF quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, a Claude Opus 4.7 reasoning-style distillation of Qwen 3.6 35B-A3B.

What's different from the official quants

  • MTP (Multi-Token Prediction) head preserved. Enables ~25% generation speedup on code workloads via --spec-type mtp in llama.cpp servers built from am17an's PR #22673. The official lordx64 quants don't include MTP support.
  • Imatrix calibration on the Q4-class quants, using tristandruyen's calibration_data_v5_rc.txt — a MoE-aware fork of bartowski's calibration_datav3 that exercises more experts during calibration. 125 chunks at 512 tokens.

Files

File Quant Size imatrix Use case
lordx64-distill-MTP-Q8_0.gguf Q8_0 35 GB No Near-lossless reference / re-quantization base
lordx64-distill-MTP-Q4_K_M.gguf Q4_K_M 20 GB Yes Balanced quality/size — default pick for most users
lordx64-distill-MTP-IQ4_XS.gguf IQ4_XS 18 GB Yes Smallest with good quality — best for tight VRAM
lordx64-imatrix.dat 187 MB Calibration data for users producing their own quants

A note on the Q8_0: it was quantized without imatrix because Q8_0 is high-precision enough that imatrix gains are negligible (~0.05% perplexity, effectively noise). Most public Q8_0 quants are non-imatrix for the same reason.

A note on the MTP head's eh_proj tensor: imatrix calibration runs forward passes through the base model only, so the MTP-specific tensors don't get importance data. They fall back to default IQ4_XS / Q4_K_M for those quants. In practice this slightly reduces drafter accuracy on creative content but doesn't break anything.

Requirements

This GGUF requires a llama.cpp build with MTP support. Pre-merge, that means am17an's mtp-clean branch. Built and tested at commit 267f8af.

You can run these GGUFs on stock llama.cpp without MTP — you just won't get the speedup. The MTP head sits as an extra block in the file (block 40, after the 40 base layers) and is ignored by builds that don't understand qwen35moe_mtp architecture.

Usage with llama-server

llama-server \
  -m lordx64-distill-MTP-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8086 \
  -ngl 99 --parallel 1 \
  --ctx-size 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on --jinja \
  --chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
  --reasoning-budget 4096 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 \
  --spec-type mtp --spec-draft-n-max 2

Flag notes:

  • --parallel 1 is required. MTP currently doesn't support multi-slot serving.
  • --spec-draft-n-max 2 is optimal for code/structured generation. For prose/creative writing, use --spec-draft-n-max 1 — the LoRA distribution shift reduces drafter alignment on creative content, and n=1 gives a more reliable speedup there.
  • --cache-type-k q8_0 --cache-type-v q8_0 saves significant VRAM at long contexts with minimal quality cost.
  • --reasoning-budget 4096 caps the <think> block length. Hard problems (competition math, multi-step logic) may benefit from raising this to 16384 at the cost of latency.

Benchmarks

Tested on AMD Radeon 780M iGPU (RDNA3, Vulkan backend, ~75 GB/s memory bandwidth):

Config Code tok/s Code accept Prose tok/s Prose accept
IQ4_XS baseline (no MTP) 27.07 26.82
IQ4_XS + MTP n=2 ~33.76 89-91% ~26.36 51-65%

Code workload sees a clean ~25% speedup. Prose workload at n=2 is roughly neutral due to lower drafter acceptance on open-ended creative content — drop to n=1 for a small positive gain on prose.

For comparison, the unmodified Qwen 3.6 35B-A3B base model with MTP (Q4_0, no imatrix) on the same hardware:

Config Code Prose
Base baseline (no MTP) 29.07 29.07
Base + MTP n=2 35.92 30.15

The slightly lower throughput on the distill vs base model is expected:

  • IQ4_XS uses lookup tables (~6% slower than Q4_0)
  • The attention-only LoRA shifts the target distribution slightly, reducing drafter alignment on creative content (acceptance drops from ~67% to 51-65% on prose)

These tradeoffs are inherent to the model+quantization combination, not the build process.

Limitations

  • Drafter alignment on creative content. The MTP head was trained against the original Qwen attention. lordx64's distill applies an attention-only LoRA, which shifts the target distribution. Drafter acceptance is high on code (89-91%) but lower on prose (51-65%).
  • Single-stream only. MTP currently requires --parallel 1 — no concurrent generations. Fine for personal use; not suitable for multi-user serving.
  • Pre-merge dependency. Built against am17an's branch. Future llama.cpp changes may break compatibility until MTP merges into mainline. Recommend pinning your llama.cpp build to a known-working commit.
  • MTP head not imatrix-calibrated. As noted above, this slightly degrades drafter precision but doesn't break inference.

Reproducibility

Build pipeline:

  1. Downloaded BF16 safetensors from upstream lordx64 repo (72 GB)
  2. Converted via am17an's convert_hf_to_gguf.py — handles Qwen3_5MoeForConditionalGeneration natively, strips the model.language_model.* wrapper automatically, drops vision tower tensors, remaps MTP namespace
  3. Generated BF16 GGUF (~70 GB)
  4. Quantized to Q8_0 with llama-quantize (no imatrix needed for Q8_0)
  5. Ran imatrix calibration on the Q8_0 file using calibration_data_v5_rc.txt — 125 chunks at 512 tokens, partial GPU offload on RX 6700XT via Vulkan
  6. Quantized BF16 → Q4_K_M and BF16 → IQ4_XS using the imatrix

The imatrix file is included in this repo for users who want to run their own quants from upstream BF16.

License

Apache 2.0, inherited from upstream Qwen 3.6 and lordx64's distill.

Acknowledgements

  • lordx64 — the underlying reasoning distillation
  • am17an — MTP implementation in llama.cpp (PR #22673)
  • bartowski — the original calibration_datav3 dataset
  • tristandruyen — MoE-aware calibration_data_v5_rc fork
  • bombdefuser-124 — Q4_0 reference GGUF used for benchmark comparison
  • Qwen team — the open-weights base model
  • Anthropic — Claude Opus 4.7, the teacher model lordx64 used for reasoning distillation
Downloads last month
7,646
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF