How to use from the
Use from the
MLX library
# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Qwen3.6-35B-A3B Claude 4.7 Opus Reasoning Distilled - MLX oQ4 MTP

MLX/oMLX 4-bit conversion of r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with Qwen MTP tensors preserved and runtime-tested in oMLX.

This is not a new training run or fine-tune. Weights were only converted/quantized for local MLX/oMLX inference.

Quick Facts

  • Architecture: Qwen3.6 35B-A3B MoE, roughly 3B active parameters per token.
  • Quantization: oQ4-style MLX 4-bit, group size 64.
  • MTP: preserved and verified in oMLX.
  • Test hardware: Apple Silicon M5 Pro with 48GB unified memory.
  • Runtime tested with oMLX native MTP enabled.

MTP Verification

MTP config: mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=false
MTP tensor entries: 42
MTP fusion projection: language_model.mtp.fc.weight present and full precision
Runtime: oMLX logged native MTP path activation during smoke tests

MTP support is runtime-specific. These tensors are preserved, but non-oMLX runtimes may ignore them.

Local Speed Smoke Test

Measured on an M5 Pro Mac with 48GB unified memory using oMLX.

MTP on average:  ~89.9 tok/s
MTP off average: ~86.0 tok/s
Speed lift:       ~+4.6% with native MTP enabled

Per-prompt MTP-on smoke results:

count120:   160 tokens, 1.726s, 92.69 tok/s, MTP accept 79/79 and 79/79
rain180:    220 tokens, 2.465s, 89.26 tok/s, MTP accept 96/123 and 98/121
jsoncities: 128 tokens, 1.457s, 87.86 tok/s, MTP accept 64/65 and 64/64

These are local smoke numbers, not a universal benchmark. Prompt, cache state, batching, oMLX version, and hardware will change results.

Usage Notes

Place the model under your oMLX model directory and enable native MTP:

~/.omlx/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP

Recommended oMLX settings:

{
  "mtp_enabled": true,
  "dflash_enabled": false,
  "turboquant_kv_enabled": false
}

Attribution

Credit to the upstream work:

Please credit the upstream authors if you use or redistribute this conversion.

Caveats

  • MTP runtime activation was verified in oMLX only.
  • This conversion is experimental.
  • Reasoning models can emit long <think> traces; set max_tokens intentionally.
  • The upstream model card notes that distillation transfers reasoning style, not new factual knowledge.

License

Apache-2.0, following the upstream model metadata. Check upstream model cards and any teacher-data usage policies before redistribution or commercial deployment.

Downloads last month
8,104
Safetensors
Model size
35B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stamsam/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MLX-oQ4-MTP

Quantized
(418)
this model