Instructions to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF", filename="lordx64-distill-MTP-IQ4_XS.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Use Docker
docker model run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
- Ollama
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Ollama:
ollama run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
- Unsloth Studio
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF to start chatting
- Pi
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Docker Model Runner:
docker model run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
- Lemonade
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3.6-35B-A3B Claude 4.7 Opus Distill — MTP-Enabled GGUF
MTP-enabled GGUF quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, a Claude Opus 4.7 reasoning-style distillation of Qwen 3.6 35B-A3B.
What's different from the official quants
- MTP (Multi-Token Prediction) head preserved. Enables ~25% generation speedup on code workloads via
--spec-type mtpin llama.cpp servers built from am17an's PR #22673. The official lordx64 quants don't include MTP support. - Imatrix calibration on the Q4-class quants, using tristandruyen's calibration_data_v5_rc.txt — a MoE-aware fork of bartowski's calibration_datav3 that exercises more experts during calibration. 125 chunks at 512 tokens.
Files
| File | Quant | Size | imatrix | Use case |
|---|---|---|---|---|
lordx64-distill-MTP-Q8_0.gguf |
Q8_0 | 35 GB | No | Near-lossless reference / re-quantization base |
lordx64-distill-MTP-Q4_K_M.gguf |
Q4_K_M | 20 GB | Yes | Balanced quality/size — default pick for most users |
lordx64-distill-MTP-IQ4_XS.gguf |
IQ4_XS | 18 GB | Yes | Smallest with good quality — best for tight VRAM |
lordx64-imatrix.dat |
— | 187 MB | — | Calibration data for users producing their own quants |
A note on the Q8_0: it was quantized without imatrix because Q8_0 is high-precision enough that imatrix gains are negligible (~0.05% perplexity, effectively noise). Most public Q8_0 quants are non-imatrix for the same reason.
A note on the MTP head's eh_proj tensor: imatrix calibration runs forward passes through the base model only, so the MTP-specific tensors don't get importance data. They fall back to default IQ4_XS / Q4_K_M for those quants. In practice this slightly reduces drafter accuracy on creative content but doesn't break anything.
Requirements
This GGUF requires a llama.cpp build with MTP support. Pre-merge, that means am17an's mtp-clean branch. Built and tested at commit 267f8af.
You can run these GGUFs on stock llama.cpp without MTP — you just won't get the speedup. The MTP head sits as an extra block in the file (block 40, after the 40 base layers) and is ignored by builds that don't understand qwen35moe_mtp architecture.
Usage with llama-server
llama-server \
-m lordx64-distill-MTP-Q4_K_M.gguf \
--host 0.0.0.0 --port 8086 \
-ngl 99 --parallel 1 \
--ctx-size 32768 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on --jinja \
--chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
--reasoning-budget 4096 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 \
--spec-type mtp --spec-draft-n-max 2
Flag notes:
--parallel 1is required. MTP currently doesn't support multi-slot serving.--spec-draft-n-max 2is optimal for code/structured generation. For prose/creative writing, use--spec-draft-n-max 1— the LoRA distribution shift reduces drafter alignment on creative content, and n=1 gives a more reliable speedup there.--cache-type-k q8_0 --cache-type-v q8_0saves significant VRAM at long contexts with minimal quality cost.--reasoning-budget 4096caps the<think>block length. Hard problems (competition math, multi-step logic) may benefit from raising this to 16384 at the cost of latency.
Benchmarks
Tested on AMD Radeon 780M iGPU (RDNA3, Vulkan backend, ~75 GB/s memory bandwidth):
| Config | Code tok/s | Code accept | Prose tok/s | Prose accept |
|---|---|---|---|---|
| IQ4_XS baseline (no MTP) | 27.07 | — | 26.82 | — |
| IQ4_XS + MTP n=2 | ~33.76 | 89-91% | ~26.36 | 51-65% |
Code workload sees a clean ~25% speedup. Prose workload at n=2 is roughly neutral due to lower drafter acceptance on open-ended creative content — drop to n=1 for a small positive gain on prose.
For comparison, the unmodified Qwen 3.6 35B-A3B base model with MTP (Q4_0, no imatrix) on the same hardware:
| Config | Code | Prose |
|---|---|---|
| Base baseline (no MTP) | 29.07 | 29.07 |
| Base + MTP n=2 | 35.92 | 30.15 |
The slightly lower throughput on the distill vs base model is expected:
- IQ4_XS uses lookup tables (~6% slower than Q4_0)
- The attention-only LoRA shifts the target distribution slightly, reducing drafter alignment on creative content (acceptance drops from ~67% to 51-65% on prose)
These tradeoffs are inherent to the model+quantization combination, not the build process.
Limitations
- Drafter alignment on creative content. The MTP head was trained against the original Qwen attention. lordx64's distill applies an attention-only LoRA, which shifts the target distribution. Drafter acceptance is high on code (89-91%) but lower on prose (51-65%).
- Single-stream only. MTP currently requires
--parallel 1— no concurrent generations. Fine for personal use; not suitable for multi-user serving. - Pre-merge dependency. Built against am17an's branch. Future llama.cpp changes may break compatibility until MTP merges into mainline. Recommend pinning your llama.cpp build to a known-working commit.
- MTP head not imatrix-calibrated. As noted above, this slightly degrades drafter precision but doesn't break inference.
Reproducibility
Build pipeline:
- Downloaded BF16 safetensors from upstream lordx64 repo (72 GB)
- Converted via am17an's
convert_hf_to_gguf.py— handlesQwen3_5MoeForConditionalGenerationnatively, strips themodel.language_model.*wrapper automatically, drops vision tower tensors, remaps MTP namespace - Generated BF16 GGUF (~70 GB)
- Quantized to Q8_0 with
llama-quantize(no imatrix needed for Q8_0) - Ran imatrix calibration on the Q8_0 file using calibration_data_v5_rc.txt — 125 chunks at 512 tokens, partial GPU offload on RX 6700XT via Vulkan
- Quantized BF16 → Q4_K_M and BF16 → IQ4_XS using the imatrix
The imatrix file is included in this repo for users who want to run their own quants from upstream BF16.
License
Apache 2.0, inherited from upstream Qwen 3.6 and lordx64's distill.
Acknowledgements
- lordx64 — the underlying reasoning distillation
- am17an — MTP implementation in llama.cpp (PR #22673)
- bartowski — the original calibration_datav3 dataset
- tristandruyen — MoE-aware calibration_data_v5_rc fork
- bombdefuser-124 — Q4_0 reference GGUF used for benchmark comparison
- Qwen team — the open-weights base model
- Anthropic — Claude Opus 4.7, the teacher model lordx64 used for reasoning distillation
- Downloads last month
- 7,646
4-bit
8-bit
Model tree for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF
Base model
Qwen/Qwen3.6-35B-A3B