Step 3.7 Flash Q4_K_M GGUF + MTP Drafts

Legacy combined repo: this repo keeps the original combined upload for existing links, but it mixes full-model GGUF and MTP draft GGUF files. Hugging Face's GGUF widget can display those draft files as if they were tiny full-model quants, which is misleading.

Use the split repos instead:

This repo contains a llama.cpp GGUF release of stepfun-ai/Step-3.7-Flash with:

  • main model quantized as Q4_K_M
  • separate MTP draft GGUFs in Q8_0, Q6_K, Q4_K_M, and BF16
  • the chat template used for the tested llama.cpp server runs

The source model is Apache-2.0. The original model is multimodal, but these GGUF artifacts were prepared and tested for text-side llama.cpp serving.

Files

File Size SHA256 Purpose
Step-3.7-Flash-Q4_K_M.gguf 111 GB 4de6519cf0131820d81137ebe6a0ab8dc225f1c463cc385038ab7de41ee7a36f Main model
Step-3.7-Flash-MTP-Q8_0.gguf 3.5 GB 017de8990140621b5b4af431448f20873fbf0b052f6c50d2afac15f45802a98d Recommended MTP draft model
Step-3.7-Flash-MTP-Q6_K.gguf 2.7 GB f41736e0dcce133d0dd0b81e14bd2965091e27dff306a28cec11ceb19fadbf46 Smaller Q6_K MTP draft model
Step-3.7-Flash-MTP-Q4_K_M.gguf 2.0 GB 44118cfe64f45b38127ad6fb626e16bd94ee5a827cb34aa83d9e6df3450aebaf Smaller Q4_K_M MTP draft model
Step-3.7-Flash-MTP-BF16.gguf 6.5 GB fd811c81d14c786d314d8006655bba61971059abcfdfb6109ce83fd768f8b289 Experimental BF16 MTP draft model
chat_template.jinja 5.6 KB f428623fc81c940c35be3509fbffc086b4b4360d8800e46103e6f34d02891633 Chat template
llama.cpp-step37-mtp.patch 30 KB aaf34eb89666407321f159edcc6c7e22baafc342fe4cec7b568a0755e8027f80 Legacy fallback patch for older llama.cpp builds

Runtime

Current llama.cpp main supports Step MTP-tail draft loading natively. This was smoke-tested with clean llama.cpp commit d545a2a993849fcf3b752d85ae256fc9d6a9de79 and --spec-type draft-mtp.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j

Older llama.cpp builds still need the included fallback patch. A clean checkout of commit 40d5358d3c730b81729ba81cd5c44ed596d02510 will fail on the draft GGUF with:

missing tensor blk.0.attn_norm.weight

Use the included patch:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout 40d5358d3c730b81729ba81cd5c44ed596d02510
curl -L -o llama.cpp-step37-mtp.patch \
  https://huggingface.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF/resolve/main/llama.cpp-step37-mtp.patch
git apply llama.cpp-step37-mtp.patch
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j

The MTP draft file is loaded with --model-draft; it is not a replacement for the main model. In local testing, the Q8_0 draft was faster than the BF16 draft, so Q8_0 is the recommended default. The Q6_K and Q4_K_M drafts are provided as smaller options.

256k Context Command

llama-server \
  --model Step-3.7-Flash-Q4_K_M.gguf \
  --model-draft Step-3.7-Flash-MTP-Q8_0.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --alias step-3-7-flash-q4-k-m-mtp-nmax2-pmin060-256k \
  --ctx-size 262144 \
  --n-gpu-layers all \
  --split-mode layer \
  --parallel 1 \
  --reasoning on \
  --reasoning-format deepseek \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.60 \
  --chat-template-file chat_template.jinja

Local Benchmark Snapshot

GPUs: RTX PRO 6000, RTX 4080 SUPER, 2x RTX 3090.

Recommended local setting after a sweep: --spec-draft-n-max 2 --spec-draft-p-min 0.60 with the Q8_0 draft.

Run Prompt tokens Prefill Decode TTFT Notes
Q4_K_M + Q8_0 MTP n_max=2 p_min=0.60 32,769 1823.47 tok/s 104.38 tok/s 18.054 s 87.1% draft accepted
Q4_K_M + BF16 MTP n_max=2 p_min=0.60 32,769 1835.66 tok/s 93.38 tok/s 17.904 s 79.3% draft accepted
Q4_K_M + BF16 MTP n_max=2 p_min=0.60 65,537 1626.84 tok/s 94.79 tok/s 40.391 s 81.2% draft accepted
Q4_K_M + MTP n_max=3 604 - 143.81 tok/s 0.415 s 172/181 draft accepted, 95.0%
Q4_K_M + MTP n_max=3 32,519 2097.79 tok/s 104.91 tok/s 15.62 s 60/73 draft accepted, 82.2%
Q4_K_M + MTP n_max=3 54,619 1909.23 tok/s 106.73 tok/s 28.82 s 60/70 draft accepted, 85.7%
Q4_K_S baseline 604 1738.12 tok/s 110.70 tok/s 0.352 s no MTP
Q4_K_S baseline 54,619 2194.42 tok/s 89.15 tok/s 25.16 s no MTP

Limited task checks:

Check Q4_K_S baseline Q4_K_M + MTP n_max=3
ARC Challenge chat, 10 samples 0.9 0.9
GSM8K strict/flexible, 10 samples 0.9 / 0.9 0.8 / 0.8
Code needle / NIAH reasoning-aware 12/12 12/12

Checksums

After download:

sha256sum -c SHA256SUMS

Notes

  • The base model advertises 256k context; this GGUF release was loaded locally at 256k context with MTP enabled.
  • The MTP draft tensors intentionally keep the upstream tail-layer numbering (blk.45, blk.46, blk.47). That matches the official Step layout and is supported by current llama.cpp main.
  • --spec-draft-n-max 2 and --spec-draft-p-min 0.60 were the best balanced local MTP settings from the tested runs.
  • The BF16 MTP draft is included for experimentation, but the Q8_0 draft was faster on the tested rig.
  • This is a community GGUF quantization/repackaging of the upstream Apache-2.0 model, not an official StepFun release.
Downloads last month
1,154
GGUF
Model size
197B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF

Quantized
(23)
this model