Instructions to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF",
	filename="Step-3.7-Flash-MTP-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Use Docker

docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Ollama
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Ollama:
```
ollama run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
```

Unsloth Studio

How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF to start chatting

How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
```

Lemonade

How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Step-3.7-Flash-Q4_K_M-MTP-GGUF-Q4_K_M

List all available models

lemonade list

Step 3.7 Flash Q4_K_M GGUF + MTP Drafts

Legacy combined repo: this repo keeps the original combined upload for existing links, but it mixes full-model GGUF and MTP draft GGUF files. Hugging Face's GGUF widget can display those draft files as if they were tiny full-model quants, which is misleading.

Use the split repos instead:

Full model: notSnix/Step-3.7-Flash-Q4_K_M-GGUF

MTP drafts: notSnix/Step-3.7-Flash-MTP-Draft-GGUF

This repo contains a llama.cpp GGUF release of stepfun-ai/Step-3.7-Flash with:

main model quantized as Q4_K_M
separate MTP draft GGUFs in Q8_0, Q6_K, Q4_K_M, and BF16
the chat template used for the tested llama.cpp server runs

The source model is Apache-2.0. The original model is multimodal, but these GGUF artifacts were prepared and tested for text-side llama.cpp serving.

Files

File	Size	SHA256	Purpose
`Step-3.7-Flash-Q4_K_M.gguf`	111 GB	`4de6519cf0131820d81137ebe6a0ab8dc225f1c463cc385038ab7de41ee7a36f`	Main model
`Step-3.7-Flash-MTP-Q8_0.gguf`	3.5 GB	`017de8990140621b5b4af431448f20873fbf0b052f6c50d2afac15f45802a98d`	Recommended MTP draft model
`Step-3.7-Flash-MTP-Q6_K.gguf`	2.7 GB	`f41736e0dcce133d0dd0b81e14bd2965091e27dff306a28cec11ceb19fadbf46`	Smaller Q6_K MTP draft model
`Step-3.7-Flash-MTP-Q4_K_M.gguf`	2.0 GB	`44118cfe64f45b38127ad6fb626e16bd94ee5a827cb34aa83d9e6df3450aebaf`	Smaller Q4_K_M MTP draft model
`Step-3.7-Flash-MTP-BF16.gguf`	6.5 GB	`fd811c81d14c786d314d8006655bba61971059abcfdfb6109ce83fd768f8b289`	Experimental BF16 MTP draft model
`chat_template.jinja`	5.6 KB	`f428623fc81c940c35be3509fbffc086b4b4360d8800e46103e6f34d02891633`	Chat template
`llama.cpp-step37-mtp.patch`	30 KB	`aaf34eb89666407321f159edcc6c7e22baafc342fe4cec7b568a0755e8027f80`	Legacy fallback patch for older llama.cpp builds

Runtime

Current llama.cpp main supports Step MTP-tail draft loading natively. This was smoke-tested with clean llama.cpp commit d545a2a993849fcf3b752d85ae256fc9d6a9de79 and --spec-type draft-mtp.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j

Older llama.cpp builds still need the included fallback patch. A clean checkout of commit 40d5358d3c730b81729ba81cd5c44ed596d02510 will fail on the draft GGUF with:

missing tensor blk.0.attn_norm.weight

Use the included patch:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout 40d5358d3c730b81729ba81cd5c44ed596d02510
curl -L -o llama.cpp-step37-mtp.patch \
  https://huggingface.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF/resolve/main/llama.cpp-step37-mtp.patch
git apply llama.cpp-step37-mtp.patch
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j

The MTP draft file is loaded with --model-draft; it is not a replacement for the main model. In local testing, the Q8_0 draft was faster than the BF16 draft, so Q8_0 is the recommended default. The Q6_K and Q4_K_M drafts are provided as smaller options.

256k Context Command

llama-server \
  --model Step-3.7-Flash-Q4_K_M.gguf \
  --model-draft Step-3.7-Flash-MTP-Q8_0.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --alias step-3-7-flash-q4-k-m-mtp-nmax2-pmin060-256k \
  --ctx-size 262144 \
  --n-gpu-layers all \
  --split-mode layer \
  --parallel 1 \
  --reasoning on \
  --reasoning-format deepseek \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-p-min 0.60 \
  --chat-template-file chat_template.jinja

Local Benchmark Snapshot

GPUs: RTX PRO 6000, RTX 4080 SUPER, 2x RTX 3090.

Recommended local setting after a sweep: --spec-draft-n-max 2 --spec-draft-p-min 0.60 with the Q8_0 draft.

Run	Prompt tokens	Prefill	Decode	TTFT	Notes
Q4_K_M + Q8_0 MTP `n_max=2 p_min=0.60`	32,769	1823.47 tok/s	104.38 tok/s	18.054 s	87.1% draft accepted
Q4_K_M + BF16 MTP `n_max=2 p_min=0.60`	32,769	1835.66 tok/s	93.38 tok/s	17.904 s	79.3% draft accepted
Q4_K_M + BF16 MTP `n_max=2 p_min=0.60`	65,537	1626.84 tok/s	94.79 tok/s	40.391 s	81.2% draft accepted
Q4_K_M + MTP `n_max=3`	604	-	143.81 tok/s	0.415 s	172/181 draft accepted, 95.0%
Q4_K_M + MTP `n_max=3`	32,519	2097.79 tok/s	104.91 tok/s	15.62 s	60/73 draft accepted, 82.2%
Q4_K_M + MTP `n_max=3`	54,619	1909.23 tok/s	106.73 tok/s	28.82 s	60/70 draft accepted, 85.7%
Q4_K_S baseline	604	1738.12 tok/s	110.70 tok/s	0.352 s	no MTP
Q4_K_S baseline	54,619	2194.42 tok/s	89.15 tok/s	25.16 s	no MTP

Limited task checks:

Check	Q4_K_S baseline	Q4_K_M + MTP `n_max=3`
ARC Challenge chat, 10 samples	0.9	0.9
GSM8K strict/flexible, 10 samples	0.9 / 0.9	0.8 / 0.8
Code needle / NIAH reasoning-aware	12/12	12/12

Checksums

After download:

sha256sum -c SHA256SUMS

Notes

The base model advertises 256k context; this GGUF release was loaded locally at 256k context with MTP enabled.
The MTP draft tensors intentionally keep the upstream tail-layer numbering (blk.45, blk.46, blk.47). That matches the official Step layout and is supported by current llama.cpp main.
--spec-draft-n-max 2 and --spec-draft-p-min 0.60 were the best balanced local MTP settings from the tested runs.
The BF16 MTP draft is included for experimentation, but the Q8_0 draft was faster on the tested rig.
This is a community GGUF quantization/repackaging of the upstream Apache-2.0 model, not an official StepFun release.

Downloads last month: 1,154

GGUF

Model size

197B params

Architecture

step35

Hardware compatibility

4-bit

6-bit

8-bit

16-bit

Model tree for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF

Base model

stepfun-ai/Step-3.7-Flash

Quantized

(23)

this model