Instructions to use stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP
Run Hermes
hermes
- MLX LM
How to use stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3.6-35B-A3B Claude 4.7 Opus Reasoning Distilled - MLX oQ4 MTP
MLX/oMLX 4-bit conversion of
r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
with Qwen MTP tensors preserved and runtime-tested in oMLX.
This is not a new training run or fine-tune. Weights were only converted/quantized for local MLX/oMLX inference.
Quick Facts
- Architecture: Qwen3.6 35B-A3B MoE, roughly 3B active parameters per token.
- Quantization: oQ4-style MLX 4-bit, group size 64.
- MTP: preserved and verified in oMLX.
- Test hardware: Apple Silicon M5 Pro with 48GB unified memory.
- Runtime tested with oMLX native MTP enabled.
MTP Verification
MTP config: mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=false
MTP tensor entries: 42
MTP fusion projection: language_model.mtp.fc.weight present and full precision
Runtime: oMLX logged native MTP path activation during smoke tests
MTP support is runtime-specific. These tensors are preserved, but non-oMLX runtimes may ignore them.
Local Speed Smoke Test
Measured on an M5 Pro Mac with 48GB unified memory using oMLX.
MTP on average: ~89.9 tok/s
MTP off average: ~86.0 tok/s
Speed lift: ~+4.6% with native MTP enabled
Per-prompt MTP-on smoke results:
count120: 160 tokens, 1.726s, 92.69 tok/s, MTP accept 79/79 and 79/79
rain180: 220 tokens, 2.465s, 89.26 tok/s, MTP accept 96/123 and 98/121
jsoncities: 128 tokens, 1.457s, 87.86 tok/s, MTP accept 64/65 and 64/64
These are local smoke numbers, not a universal benchmark. Prompt, cache state, batching, oMLX version, and hardware will change results.
Usage Notes
Place the model under your oMLX model directory and enable native MTP:
~/.omlx/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP
Recommended oMLX settings:
{
"mtp_enabled": true,
"dflash_enabled": false,
"turboquant_kv_enabled": false
}
Attribution
Credit to the upstream work:
Qwen/Qwen3.6-35B-A3Bfor the base model.lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilledfor the Claude Opus 4.7 reasoning distillation.r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilledfor the source checkpoint used here.- Anthropic Claude Opus 4.7 as the teacher model used by the upstream distillation.
- oMLX / MLX community for the Apple Silicon runtime and MTP support.
Please credit the upstream authors if you use or redistribute this conversion.
Caveats
- MTP runtime activation was verified in oMLX only.
- This conversion is experimental.
- Reasoning models can emit long
<think>traces; setmax_tokensintentionally. - The upstream model card notes that distillation transfers reasoning style, not new factual knowledge.
License
Apache-2.0, following the upstream model metadata. Check upstream model cards and any teacher-data usage policies before redistribution or commercial deployment.
- Downloads last month
- 8,104
4-bit
Model tree for stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP
Base model
Qwen/Qwen3.6-35B-A3B
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True)