Instructions to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF",
	filename="lordx64-distill-MTP-IQ4_XS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Ollama
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Ollama:
```
ollama run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
```

Unsloth Studio

How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF to start chatting

How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M
```

Lemonade

How to use Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3.6-35B-A3B Claude 4.7 Opus Distill — MTP-Enabled GGUF

MTP-enabled GGUF quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, a Claude Opus 4.7 reasoning-style distillation of Qwen 3.6 35B-A3B.

What's different from the official quants

MTP (Multi-Token Prediction) head preserved. Enables ~25% generation speedup on code workloads via --spec-type mtp in llama.cpp servers built from am17an's PR #22673. The official lordx64 quants don't include MTP support.
Imatrix calibration on the Q4-class quants, using tristandruyen's calibration_data_v5_rc.txt — a MoE-aware fork of bartowski's calibration_datav3 that exercises more experts during calibration. 125 chunks at 512 tokens.

Files

File	Quant	Size	imatrix	Use case
`lordx64-distill-MTP-Q8_0.gguf`	Q8_0	35 GB	No	Near-lossless reference / re-quantization base
`lordx64-distill-MTP-Q4_K_M.gguf`	Q4_K_M	20 GB	Yes	Balanced quality/size — default pick for most users
`lordx64-distill-MTP-IQ4_XS.gguf`	IQ4_XS	18 GB	Yes	Smallest with good quality — best for tight VRAM
`lordx64-imatrix.dat`	—	187 MB	—	Calibration data for users producing their own quants

A note on the Q8_0: it was quantized without imatrix because Q8_0 is high-precision enough that imatrix gains are negligible (~0.05% perplexity, effectively noise). Most public Q8_0 quants are non-imatrix for the same reason.

A note on the MTP head's eh_proj tensor: imatrix calibration runs forward passes through the base model only, so the MTP-specific tensors don't get importance data. They fall back to default IQ4_XS / Q4_K_M for those quants. In practice this slightly reduces drafter accuracy on creative content but doesn't break anything.

Requirements

This GGUF requires a llama.cpp build with MTP support. Pre-merge, that means am17an's mtp-clean branch. Built and tested at commit 267f8af.

You can run these GGUFs on stock llama.cpp without MTP — you just won't get the speedup. The MTP head sits as an extra block in the file (block 40, after the 40 base layers) and is ignored by builds that don't understand qwen35moe_mtp architecture.

Usage with llama-server

llama-server \
  -m lordx64-distill-MTP-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8086 \
  -ngl 99 --parallel 1 \
  --ctx-size 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on --jinja \
  --chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}' \
  --reasoning-budget 4096 \
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 \
  --spec-type mtp --spec-draft-n-max 2

Flag notes:

--parallel 1 is required. MTP currently doesn't support multi-slot serving.
--spec-draft-n-max 2 is optimal for code/structured generation. For prose/creative writing, use --spec-draft-n-max 1 — the LoRA distribution shift reduces drafter alignment on creative content, and n=1 gives a more reliable speedup there.
--cache-type-k q8_0 --cache-type-v q8_0 saves significant VRAM at long contexts with minimal quality cost.
--reasoning-budget 4096 caps the <think> block length. Hard problems (competition math, multi-step logic) may benefit from raising this to 16384 at the cost of latency.

Benchmarks

Tested on AMD Radeon 780M iGPU (RDNA3, Vulkan backend, ~75 GB/s memory bandwidth):

Config	Code tok/s	Code accept	Prose tok/s	Prose accept
IQ4_XS baseline (no MTP)	27.07	—	26.82	—
IQ4_XS + MTP n=2	~33.76	89-91%	~26.36	51-65%

Code workload sees a clean ~25% speedup. Prose workload at n=2 is roughly neutral due to lower drafter acceptance on open-ended creative content — drop to n=1 for a small positive gain on prose.

For comparison, the unmodified Qwen 3.6 35B-A3B base model with MTP (Q4_0, no imatrix) on the same hardware:

Config	Code	Prose
Base baseline (no MTP)	29.07	29.07
Base + MTP n=2	35.92	30.15

The slightly lower throughput on the distill vs base model is expected:

IQ4_XS uses lookup tables (~6% slower than Q4_0)
The attention-only LoRA shifts the target distribution slightly, reducing drafter alignment on creative content (acceptance drops from ~67% to 51-65% on prose)

These tradeoffs are inherent to the model+quantization combination, not the build process.

Limitations

Drafter alignment on creative content. The MTP head was trained against the original Qwen attention. lordx64's distill applies an attention-only LoRA, which shifts the target distribution. Drafter acceptance is high on code (89-91%) but lower on prose (51-65%).
Single-stream only. MTP currently requires --parallel 1 — no concurrent generations. Fine for personal use; not suitable for multi-user serving.
Pre-merge dependency. Built against am17an's branch. Future llama.cpp changes may break compatibility until MTP merges into mainline. Recommend pinning your llama.cpp build to a known-working commit.
MTP head not imatrix-calibrated. As noted above, this slightly degrades drafter precision but doesn't break inference.

Reproducibility

Build pipeline:

Downloaded BF16 safetensors from upstream lordx64 repo (72 GB)
Converted via am17an's convert_hf_to_gguf.py — handles Qwen3_5MoeForConditionalGeneration natively, strips the model.language_model.* wrapper automatically, drops vision tower tensors, remaps MTP namespace
Generated BF16 GGUF (~70 GB)
Quantized to Q8_0 with llama-quantize (no imatrix needed for Q8_0)
Ran imatrix calibration on the Q8_0 file using calibration_data_v5_rc.txt — 125 chunks at 512 tokens, partial GPU offload on RX 6700XT via Vulkan
Quantized BF16 → Q4_K_M and BF16 → IQ4_XS using the imatrix

The imatrix file is included in this repo for users who want to run their own quants from upstream BF16.

License

Apache 2.0, inherited from upstream Qwen 3.6 and lordx64's distill.

Acknowledgements

lordx64 — the underlying reasoning distillation
am17an — MTP implementation in llama.cpp (PR #22673)
bartowski — the original calibration_datav3 dataset
tristandruyen — MoE-aware calibration_data_v5_rc fork
bombdefuser-124 — Q4_0 reference GGUF used for benchmark comparison
Qwen team — the open-weights base model
Anthropic — Claude Opus 4.7, the teacher model lordx64 used for reasoning distillation

Downloads last month: 7,646

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

4-bit

8-bit

Model tree for Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Adapter

lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Quantized

(37)

this model