- DeepSeek-V4-Flash · MTP · W4A16-FP8 · Acti
- Results — single-stream decode TPS, greedy, vs no-MTP baseline
- How to run (production canonical, on this hardware family)
- Architecture (unchanged from the base model + MTP head added)
- Quantization scheme (per-tensor)
- Acti additions (vs
pastapaul/DeepSeek-V4-Flash-W4A16-FP8) - Provenance
- Files in this repo
- Limitations
- Future work
- Citation / attribution
DeepSeek-V4-Flash · MTP · W4A16-FP8 · Acti
MTP-enabled (Multi-Token Prediction) variant of pastapaul/DeepSeek-V4-Flash-W4A16-FP8, prepared by Acti for vLLM tensor-parallel deployment at TP=2 with self-speculative decoding turned on.
The base model is unchanged. We added back the DSV4 MTP head (which the original quantization stripped because HF transformers' DSV4 modeling code drops mtp.* keys at load time) so vLLM can run --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and get a real wall-clock decode speedup.
✅ v2 release: real GPTQ quantization on the MTP routed experts (replaces the v1 RTN loader-format experiment). Calibrated with 256 ultrachat_200k prompts × 256 max_tokens captured from the running pasta-paul model, processed by a Frantar-style GPTQ pass with Cholesky H⁻¹. Per-expert mean: ~11k calibration tokens, min 430, max 175k. Manifest now declares
is_final_quality_preserving: true.
Results — single-stream decode TPS, greedy, vs no-MTP baseline
Hardware: 2× NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (sm_120, 96 GB ea., no NVLink, NUMA NODE PCIe), driver 580.126.09, NCCL 2.28.9, jasl/vllm b158e5001 + cherry-picks + Acti's MTP-loader patches.
| Profile | Decode TPS | TTFT | Δ vs base |
|---|---|---|---|
| Base (pasta-paul, no MTP), 524k | 52.85 | 91 ms | 0% (reference) |
| This model (v2 GPTQ), 524k | 85.52 | 155 ms | +62% (1.62×) |
| This model (v2 GPTQ), 128k single-stream | ~111 | ~310 ms | +110% (2.10×) |
(v1 RTN release was 83.97 tok/s @ 524k; v2 GPTQ adds ~+2% on top of that. Most of the speedup over base comes from MTP self-speculation; the GPTQ vs RTN delta is the MTP draft acceptance-rate improvement from better expert weights.)
Concurrency profiles — real measurements
The single-stream numbers above are one point on a curve. Below are the multi-user concurrency results from a thread-pool sweep on the same hardware. Different --max-model-len profiles trade off context length against how many concurrent users you can serve.
| Profile | Max concurrent users | Per-stream TPS at max | Aggregate TPS | Recommended for |
|---|---|---|---|---|
128k (--max-model-len 131072 --max-num-seqs 8 --gpu-memory-utilization 0.90) |
8 | 57 | 467 | High-concurrency chat / agent fleet |
256k (--max-model-len 262144 --max-num-seqs 4 --gpu-memory-utilization 0.95) |
4 | 76 | 296 | Medium-concurrency, longer docs / RAG |
524k (--max-model-len 524288 --max-num-seqs 3 --gpu-memory-utilization 0.93) |
3 | 46 | 138 | Single/few users, max context |
Per-stream scaling within each profile:
| N concurrent | 128k profile | 256k profile | 524k profile |
|---|---|---|---|
| 1 | 75 | 73 | 87 |
| 2 | 80 | 75 | 62 |
| 3 | (anomalous) | (anomalous) | 46 |
| 4 | 76 | 76 | — |
| 8 | 57 | — | — |
(The N=3 anomaly is reproducible — likely a vLLM MTP-draft batching artifact at odd counts. Even N values scale cleanly.)
Real-world workload sweep — 15-prompt suite
Beyond the synthetic decode bench, a 15-prompt real-world suite (chat, coding, retrieval, writing, planning, with system prompt + tools enabled, no max_tokens cap) averaged 84.48 tok/s at the 524k profile and 79.82 tok/s at the 640k profile — confirming the headline numbers hold under realistic usage:
Public-benchmark snapshot
| Benchmark | Sample | Score |
|---|---|---|
GSM8K (T=0, COT-style, exact-match on #### N) |
100 | 93.0% (93/100) |
| HumanEval pass@1 (T=0, greedy, subprocess-exec tests) | 164 (full) | 96.3% (158/164) |
| MMLU (T=0, mixed subjects, max_tokens=128 to leave room for reasoning) | 100 | 53.0% (53/100) ¹ |
| Internal capability eval (math/code/reasoning/knowledge/instruction/longform/tools) | 14 | 13/14 (92.9%) |
The HumanEval result is the full 164-problem set with real unit-test execution (greedy T=0, content-only extraction, 60 s test timeout, sandboxed subprocess). Of the 6 failures: 3 were NameError (model named the function differently from the prompt's entry_point), 2 were AssertionError (real wrong answers), 1 was AttributeError (used list.add instead of list.append). Total wall time for the full run: 11.3 min at concurrency 2 on this hardware.
¹ The MMLU sample mixed all 57 subjects uniformly (incl. hard categories like formal_logic, college_*, professional_law). The model's main forward path is identical to pastapaul/DeepSeek-V4-Flash-W4A16-FP8, so headline numbers track that base — refer to base-model evals for tight numbers across full leaderboards.
How to run (production canonical, on this hardware family)
vLLM does not load this model on a vanilla install. You need:
- The patched vLLM fork:
jasl/vllmatb158e5001(ords4-sm120-experimentaltip), with the Acti MTP patches applied (one-line additions todeepseek_v4_mtp.py— see Patches below). - The DSV4-Flash-FP8 toolchain:
kylesayrs/deepseek-ctcherry-pick, the pasta-paulpacked_modules_mappingpatch, torch 2.11.0+cu128,compressed-tensors >= 0.15.0.1. - On RTX PRO 6000 Max-Q workstation cards (no NVLink): you must pass
--disable-custom-all-reduce— vLLM's CustomAllreduce uses CUDA P2P that deadlocks on this PCIe-only topology, independent of NCCL'sNCCL_P2P_DISABLE.
A complete reproducible workspace, including all the patches and a one-shot install script, is at:
https://github.com/pasta-paul/dsv4-flash-w4a16-fp8 (base model + serving stack)
Plus the additional Acti changes documented in the "Acti additions" section below.
Docker deployment (recommended for portability)
A reproducible Docker image build is provided in docker/. The image bakes in the patched vLLM fork, applies all three Acti MTP-loader patches inline via a deterministic patcher script, and exposes a vLLM OpenAI-compatible API on port 8000. Model weights are mounted as a volume so the image stays ~12 GB (instead of ~160 GB).
# 1. Clone the build assets
git clone https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 dsv4
cd dsv4/docker
# 2. Build the image (~25-45 min for CUDA kernel compile)
docker build -t dsv4-flash-acti-mtp:0.1.0 .
# 3. Download the model weights
hf download LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
--local-dir $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8
# 4. Run (the production 524k config is the default)
docker run --rm --gpus all \
--shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8000:8000 \
-v $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:/models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:ro \
dsv4-flash-acti-mtp:0.1.0
Switch profile via env vars at run time. Two examples:
# High-concurrency 128k (8 concurrent users, 467 tok/s aggregate)
docker run ... \
-e MAX_MODEL_LEN=131072 -e MAX_NUM_SEQS=8 -e GPU_MEMORY_UTILIZATION=0.90 \
dsv4-flash-acti-mtp:0.1.0
# Medium 256k (4 concurrent users, 296 tok/s aggregate)
docker run ... \
-e MAX_MODEL_LEN=262144 -e MAX_NUM_SEQS=4 -e GPU_MEMORY_UTILIZATION=0.95 \
dsv4-flash-acti-mtp:0.1.0
To publish your build to a registry (Docker Hub / ghcr.io / private):
docker tag dsv4-flash-acti-mtp:0.1.0 <your-namespace>/dsv4-flash-acti-mtp:0.1.0
docker push <your-namespace>/dsv4-flash-acti-mtp:0.1.0
The HF repo holds the Dockerfile, entrypoint, patcher script, and a docker-compose example — not the compiled image binary itself (which exceeds practical HF LFS limits). Build it on the host that will serve. See docker/README.md for full details.
Validated 524k serve command (bare-metal vLLM)
vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
--served-model-name deepseek-v4-flash deepseek-v4-flash-mtp \
DSV4-W4A16-FP8 deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 --block-size 256 \
--max-model-len 524288 \
--max-num-seqs 2 --max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.93 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 --enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--trust-remote-code \
--disable-custom-all-reduce \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--host 0.0.0.0 --port 8000
Required env (Acti's working set, includes the small-msg AR latency tuning that drops TTFT from 154 ms to 91 ms on Max-Q):
export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
export NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_SHM_DISABLE=0
export NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512
For shorter context with two streams concurrently, drop --max-model-len to 262144 and --max-num-seqs to 2 (still 2× concurrency at 1.7×).
Architecture (unchanged from the base model + MTP head added)
| Property | Value |
|---|---|
| Total parameters | 284 B (13 B activated) |
| Decoder layers | 43 + 1 MTP layer |
| Routed experts / layer | 256 (top-K = 6 with noaux_tc routing) + 1 shared expert |
| Hidden size | 4096 |
| Routed expert intermediate | 2048 |
| Vocab size | 129 280 |
| Max position embeddings | 1 048 576 |
num_nextn_predict_layers |
1 |
| Quantized size on disk | ~146 GB (vs. ~543 GB BF16) |
Quantization scheme (per-tensor)
| Component | Format | Method |
|---|---|---|
| Routed experts (256 × 43 main layers) | W4A16 INT4 group=128 sym | GPTQ (pasta-paul, dampening_frac=0.1) |
| Routed experts (256 × MTP layer) | W4A16 INT4 group=128 sym | GPTQ (Acti, Frantar-style with Cholesky H⁻¹, damp=0.01) |
| Attention projections (q/kv/o, compressor, indexer) | FP8_BLOCK 128×128 | data-free (carries through from upstream) |
| MTP attention | FP8_BLOCK 128×128 | upstream MX-FP8 → vLLM-format FP8_BLOCK (Acti's renamed scale field) |
| Shared experts | BF16 | excluded |
lm_head, embeddings, hc_*, norms, gates, attn_sinks, MTP e_proj/h_proj |
BF16/FP32 | excluded |
How the MTP routed-expert GPTQ was done
- Boot pasta-paul base model in vLLM with the v1 RTN MTP block + an instrumentation patch that dumps the MTP layer's input arrays (
previous_hidden_states,inputs_embeds) to disk per forward call. Run in--enforce-eager(compiled-mode + dynamo doesn't permit the file I/O hook in-graph). - Send 256 ultrachat_200k prompts × max_tokens=256 through the server. ~17.7k MTP forward calls captured, 473k tokens total.
- Stop the server. Build a BF16 MTP block in pure PyTorch from the upstream FP8 master (
deepseek-ai/DeepSeek-V4-Flashshard 46), dequantizing NVFP4 routed-expert weights via E2M1 lookup × MX-FP8 scale. - Replay each captured
(previous_hidden_states, inputs_embeds)through a simplified MTP forward (RMSNorms + h_proj/e_proj real; attention skipped, treated as identity-residual; FFN gate routes tokens to top-K experts; per-expert input activations gathered). - For each of 256 experts × 3 projections (
w1/w3/w2):w1/w3share a Hessian H computed from the FFN gate's routed inputs.w2Hessian is fromsilu(w1·x) ⊙ (w3·x)intermediate activations.- Run Frantar's GPTQ algorithm with Cholesky H⁻¹ updates (damp=0.01); per-row, per-128-element-group symmetric INT4 quantization.
- Pack to compressed-tensors W4A16 INT4 format using the canonical
compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32(8 sym-int4 per int32, LSB-first, +8 offset). - Splice into a new shard, replacing the v1 RTN values; format byte-for-byte identical to the base ckpt's main-layer expert format.
Per-expert calibration tokens: min=430, max=175,375, mean=11,094. The min is borderline for a 4096-dim Hessian; those experts will be slightly less optimal but the damping factor compensates. The simplification of skipping attention during replay is a pragmatic compromise vs. re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch (it gives a less-perfect input distribution but still beats RTN by a measurable margin).
Acti additions (vs pastapaul/DeepSeek-V4-Flash-W4A16-FP8)
- MTP weights present. 768 routed-expert tensors (
mtp.0.ffn.experts.{0..255}.{w1,w2,w3}) packed in compressed-tensors W4A16 INT4 (group=128 sym), 5 attention projections at FP8_BLOCK with the field rename, 3 BF16 shared experts, and the BF16/FP32 norms /e_proj/h_proj/enorm/hnorm/attn_norm/attn_sink/ gate / hc_* tensors. New shard:model-mtp-w4a16.safetensors(3.55 GB). - Updated
quantization_config.ignorethat excludes the MTP non-quantized layer prefixes (layers.43.{e_proj, h_proj, shared_head.head, shared_experts.{w1,w2,w3}}andmtp_block.*aliases) from the W4A16/FP8 groups while keeping the routed experts in the W4A16 group_1 regex. - vLLM patches (mirror these into your fork; total ~30 lines):
vllm/model_executor/models/deepseek_v4_mtp.py:- Pass
prefix=f"{prefix}.e_proj"/{prefix}.h_projto theReplicatedLinearinstances insideDeepSeekV4MultiTokenPredictorLayer. - Add class attribute
packed_modules_mapping = {"fused_wqa_wkv": ["wq_a", "wkv"], "fused_wkv_wgate": ["wkv", "wgate"], "gate_up_proj": ["w1", "w3"]}toDeepSeekV4MTPso the MTP block's attention fuses intoattn.fused_wqa_wkv. - In
load_weights, when remapping.scale→.weight_scale*for non-expert tensors, use.weight_scale(not.weight_scale_inv) — pasta-paul's compressed-tensors FP8_BLOCK convention.
- Pass
- GPTQ quantization tool (
mtp-local-requant/gptq/run_gptq.pyin Acti's repro repo) — standalone, CPU/single-GPU, no llmcompressor dependency. Reproducible from upstream shard 46 + the captured calibration dumps.
Provenance
This checkpoint was constructed by:
- Starting from
pastapaul/DeepSeek-V4-Flash-W4A16-FP8(4 safetensors shards, ~143 GB). - Pulling shard 46 of
deepseek-ai/DeepSeek-V4-Flash(the upstream FP8 master) — that shard contains all 1575mtp.0.*tensors. - For each of the 768 MTP routed-expert tensors: dequantize NVFP4 → BF16, then run Frantar-style GPTQ using calibration activations captured from a live serving of pasta-paul + v1 RTN MTP. Pack via
compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32. - For each of the 5 MTP attention projections: keep upstream's FP8
weightas-is, decode the upstream MX-FP8 scale to BF16 and rename.scale→.weight_scale(pasta-paul convention). - For BF16 / FP32 / shared-expert tensors: dequantize FP8 to BF16 where applicable, otherwise pass through.
- Write a single new shard
model-mtp-w4a16.safetensors, hardlink the 4 base shards into the new dir, write a union safetensors index and an updated config with the rebuilt ignore list.
The full pipeline is reproducible in ~30 minutes total: ~25 min calibration capture (eager-mode serve + 256 prompts), ~3 min Hessian accumulation + GPTQ on a single GPU, ~2 min splice + manifest writeout.
Files in this repo
model-{00001..00004}-of-00004.safetensors— pasta-paul's base shards (W4A16 routed experts, FP8_BLOCK attention, BF16 shared experts) — redistributed under Apache 2.0 with attribution.model-mtp-w4a16.safetensors— Acti's MTP layer (3.55 GB; GPTQ-quantized in v2).model.safetensors.index.json— union index.config.json— base config + rebuiltquantization_config.ignorefor MTP submodules.MTP_QUANT_MANIFEST.{json,md}— declaresis_final_quality_preserving: true(real GPTQ).recipe.yaml— pasta-paul's GPTQ recipe (for the base layers; included for transparency).tokenizer.json,tokenizer_config.json,generation_config.json— unchanged from base.
Limitations
- Hardware: validated only on 2× RTX PRO 6000 Blackwell Max-Q (sm_120). Should also work on RTX PRO 6000 Server, DGX Spark / GB10, and 8× H200 — same code path as base. Without
--disable-custom-all-reduce, Max-Q deadlocks at post-graph eager warmup. - TP: TP=2 only. TP=1 OOMs on a single 96 GB GPU; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
- MTP quality: GPTQ pipeline used a simplified MTP forward that skipped the attention layer in calibration (used identity-residual). This is a known approximation chosen to avoid re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch — it captures the post-FFN-norm input distribution to each expert, but the attention's contribution to that distribution is approximated. The output produced by the served model is unaffected (the main model verifies every accepted draft); only the MTP draft accept rate is impacted. Lift on this hardware vs. v1 RTN: ~+2% decode TPS on long-context profile.
num_speculative_tokens: capped at 1 because DSV4-Flash ships exactly one MTP head (num_nextn_predict_layers=1). Higher values would not produce more draft tokens.- Reasoning parser: with
--reasoning-parser deepseek_v4, model output is split intocontentandreasoning_content— applications that read onlycontentwill see empty strings for "thinking" responses. Adjust accordingly.
Future work
A higher-fidelity GPTQ pass on the MTP block would re-implement DSV4 hybrid attention (CSA+HCA) in pure PyTorch (or hook it directly out of vLLM's compiled forward) so calibration sees the real post-attention residual distribution rather than the identity-residual approximation we use today. Estimated additional lift: ~+5-10% decode TPS, requiring a ~1-day code change. Documented in the project's MTP_LOCAL_REQUANT_STATUS.md follow-ups.
Citation / attribution
If this model helps you, please cite both:
- DeepSeek-AI for
deepseek-ai/DeepSeek-V4-Flash— the source model. - pasta-paul for
pastapaul/DeepSeek-V4-Flash-W4A16-FP8— the W4A16 GPTQ quantization and the validated jasl/vllm serving stack (repo). - Acti for the MTP-head requantization (RTN v1, GPTQ v2) and the vLLM-MTP loader patches in this repo.
License: Apache 2.0 (matching upstream).
- Downloads last month
- 2,723
Model tree for LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8
Base model
deepseek-ai/DeepSeek-V4-Flash



