Instructions to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF", filename="Step-3.7-Flash-MTP-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Use Docker
docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
- Ollama
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Ollama:
ollama run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
- Unsloth Studio
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF to start chatting
- Pi
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Docker Model Runner:
docker model run hf.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
- Lemonade
How to use notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Step-3.7-Flash-Q4_K_M-MTP-GGUF-Q4_K_M
List all available models
lemonade list
Step 3.7 Flash Q4_K_M GGUF + MTP Drafts
Legacy combined repo: this repo keeps the original combined upload for existing links, but it mixes full-model GGUF and MTP draft GGUF files. Hugging Face's GGUF widget can display those draft files as if they were tiny full-model quants, which is misleading.
Use the split repos instead:
- Full model: notSnix/Step-3.7-Flash-Q4_K_M-GGUF
- MTP drafts: notSnix/Step-3.7-Flash-MTP-Draft-GGUF
This repo contains a llama.cpp GGUF release of stepfun-ai/Step-3.7-Flash with:
- main model quantized as
Q4_K_M - separate MTP draft GGUFs in
Q8_0,Q6_K,Q4_K_M, andBF16 - the chat template used for the tested llama.cpp server runs
The source model is Apache-2.0. The original model is multimodal, but these GGUF artifacts were prepared and tested for text-side llama.cpp serving.
Files
| File | Size | SHA256 | Purpose |
|---|---|---|---|
Step-3.7-Flash-Q4_K_M.gguf |
111 GB | 4de6519cf0131820d81137ebe6a0ab8dc225f1c463cc385038ab7de41ee7a36f |
Main model |
Step-3.7-Flash-MTP-Q8_0.gguf |
3.5 GB | 017de8990140621b5b4af431448f20873fbf0b052f6c50d2afac15f45802a98d |
Recommended MTP draft model |
Step-3.7-Flash-MTP-Q6_K.gguf |
2.7 GB | f41736e0dcce133d0dd0b81e14bd2965091e27dff306a28cec11ceb19fadbf46 |
Smaller Q6_K MTP draft model |
Step-3.7-Flash-MTP-Q4_K_M.gguf |
2.0 GB | 44118cfe64f45b38127ad6fb626e16bd94ee5a827cb34aa83d9e6df3450aebaf |
Smaller Q4_K_M MTP draft model |
Step-3.7-Flash-MTP-BF16.gguf |
6.5 GB | fd811c81d14c786d314d8006655bba61971059abcfdfb6109ce83fd768f8b289 |
Experimental BF16 MTP draft model |
chat_template.jinja |
5.6 KB | f428623fc81c940c35be3509fbffc086b4b4360d8800e46103e6f34d02891633 |
Chat template |
llama.cpp-step37-mtp.patch |
30 KB | aaf34eb89666407321f159edcc6c7e22baafc342fe4cec7b568a0755e8027f80 |
Legacy fallback patch for older llama.cpp builds |
Runtime
Current llama.cpp main supports Step MTP-tail draft loading natively. This was smoke-tested with clean llama.cpp commit d545a2a993849fcf3b752d85ae256fc9d6a9de79 and --spec-type draft-mtp.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j
Older llama.cpp builds still need the included fallback patch. A clean checkout of commit 40d5358d3c730b81729ba81cd5c44ed596d02510 will fail on the draft GGUF with:
missing tensor blk.0.attn_norm.weight
Use the included patch:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout 40d5358d3c730b81729ba81cd5c44ed596d02510
curl -L -o llama.cpp-step37-mtp.patch \
https://huggingface.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF/resolve/main/llama.cpp-step37-mtp.patch
git apply llama.cpp-step37-mtp.patch
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j
The MTP draft file is loaded with --model-draft; it is not a replacement for the main model. In local testing, the Q8_0 draft was faster than the BF16 draft, so Q8_0 is the recommended default. The Q6_K and Q4_K_M drafts are provided as smaller options.
256k Context Command
llama-server \
--model Step-3.7-Flash-Q4_K_M.gguf \
--model-draft Step-3.7-Flash-MTP-Q8_0.gguf \
--host 0.0.0.0 \
--port 8000 \
--alias step-3-7-flash-q4-k-m-mtp-nmax2-pmin060-256k \
--ctx-size 262144 \
--n-gpu-layers all \
--split-mode layer \
--parallel 1 \
--reasoning on \
--reasoning-format deepseek \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.60 \
--chat-template-file chat_template.jinja
Local Benchmark Snapshot
GPUs: RTX PRO 6000, RTX 4080 SUPER, 2x RTX 3090.
Recommended local setting after a sweep: --spec-draft-n-max 2 --spec-draft-p-min 0.60 with the Q8_0 draft.
| Run | Prompt tokens | Prefill | Decode | TTFT | Notes |
|---|---|---|---|---|---|
Q4_K_M + Q8_0 MTP n_max=2 p_min=0.60 |
32,769 | 1823.47 tok/s | 104.38 tok/s | 18.054 s | 87.1% draft accepted |
Q4_K_M + BF16 MTP n_max=2 p_min=0.60 |
32,769 | 1835.66 tok/s | 93.38 tok/s | 17.904 s | 79.3% draft accepted |
Q4_K_M + BF16 MTP n_max=2 p_min=0.60 |
65,537 | 1626.84 tok/s | 94.79 tok/s | 40.391 s | 81.2% draft accepted |
Q4_K_M + MTP n_max=3 |
604 | - | 143.81 tok/s | 0.415 s | 172/181 draft accepted, 95.0% |
Q4_K_M + MTP n_max=3 |
32,519 | 2097.79 tok/s | 104.91 tok/s | 15.62 s | 60/73 draft accepted, 82.2% |
Q4_K_M + MTP n_max=3 |
54,619 | 1909.23 tok/s | 106.73 tok/s | 28.82 s | 60/70 draft accepted, 85.7% |
| Q4_K_S baseline | 604 | 1738.12 tok/s | 110.70 tok/s | 0.352 s | no MTP |
| Q4_K_S baseline | 54,619 | 2194.42 tok/s | 89.15 tok/s | 25.16 s | no MTP |
Limited task checks:
| Check | Q4_K_S baseline | Q4_K_M + MTP n_max=3 |
|---|---|---|
| ARC Challenge chat, 10 samples | 0.9 | 0.9 |
| GSM8K strict/flexible, 10 samples | 0.9 / 0.9 | 0.8 / 0.8 |
| Code needle / NIAH reasoning-aware | 12/12 | 12/12 |
Checksums
After download:
sha256sum -c SHA256SUMS
Notes
- The base model advertises 256k context; this GGUF release was loaded locally at 256k context with MTP enabled.
- The MTP draft tensors intentionally keep the upstream tail-layer numbering (
blk.45,blk.46,blk.47). That matches the official Step layout and is supported by current llama.cpp main. --spec-draft-n-max 2and--spec-draft-p-min 0.60were the best balanced local MTP settings from the tested runs.- The BF16 MTP draft is included for experimentation, but the Q8_0 draft was faster on the tested rig.
- This is a community GGUF quantization/repackaging of the upstream Apache-2.0 model, not an official StepFun release.
- Downloads last month
- 1,154
4-bit
6-bit
8-bit
16-bit
Model tree for notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF
Base model
stepfun-ai/Step-3.7-Flash