DeepSeek-V4-Flash · MTP · W4A16-FP8 · Acti

MTP-enabled (Multi-Token Prediction) variant of pastapaul/DeepSeek-V4-Flash-W4A16-FP8, prepared by Acti for vLLM tensor-parallel deployment at TP=2 with self-speculative decoding turned on.

The base model is unchanged. We added back the DSV4 MTP head (which the original quantization stripped because HF transformers' DSV4 modeling code drops mtp.* keys at load time) so vLLM can run --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and get a real wall-clock decode speedup.

v2 release: real GPTQ quantization on the MTP routed experts (replaces the v1 RTN loader-format experiment). Calibrated with 256 ultrachat_200k prompts × 256 max_tokens captured from the running pasta-paul model, processed by a Frantar-style GPTQ pass with Cholesky H⁻¹. Per-expert mean: ~11k calibration tokens, min 430, max 175k. Manifest now declares is_final_quality_preserving: true.

Results — single-stream decode TPS, greedy, vs no-MTP baseline

Hardware: 2× NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (sm_120, 96 GB ea., no NVLink, NUMA NODE PCIe), driver 580.126.09, NCCL 2.28.9, jasl/vllm b158e5001 + cherry-picks + Acti's MTP-loader patches.

Headline decode TPS — base vs Acti MTP at 524k and 128k single-stream

Profile Decode TPS TTFT Δ vs base
Base (pasta-paul, no MTP), 524k 52.85 91 ms 0% (reference)
This model (v2 GPTQ), 524k 85.52 155 ms +62% (1.62×)
This model (v2 GPTQ), 128k single-stream ~111 ~310 ms +110% (2.10×)

(v1 RTN release was 83.97 tok/s @ 524k; v2 GPTQ adds ~+2% on top of that. Most of the speedup over base comes from MTP self-speculation; the GPTQ vs RTN delta is the MTP draft acceptance-rate improvement from better expert weights.)

Concurrency profiles — real measurements

The single-stream numbers above are one point on a curve. Below are the multi-user concurrency results from a thread-pool sweep on the same hardware. Different --max-model-len profiles trade off context length against how many concurrent users you can serve.

Aggregate decode TPS by profile

Profile Max concurrent users Per-stream TPS at max Aggregate TPS Recommended for
128k (--max-model-len 131072 --max-num-seqs 8 --gpu-memory-utilization 0.90) 8 57 467 High-concurrency chat / agent fleet
256k (--max-model-len 262144 --max-num-seqs 4 --gpu-memory-utilization 0.95) 4 76 296 Medium-concurrency, longer docs / RAG
524k (--max-model-len 524288 --max-num-seqs 3 --gpu-memory-utilization 0.93) 3 46 138 Single/few users, max context

Per-stream scaling within each profile:

Per-stream TPS vs concurrent streams, by profile

N concurrent 128k profile 256k profile 524k profile
1 75 73 87
2 80 75 62
3 (anomalous) (anomalous) 46
4 76 76
8 57

(The N=3 anomaly is reproducible — likely a vLLM MTP-draft batching artifact at odd counts. Even N values scale cleanly.)

Real-world workload sweep — 15-prompt suite

Beyond the synthetic decode bench, a 15-prompt real-world suite (chat, coding, retrieval, writing, planning, with system prompt + tools enabled, no max_tokens cap) averaged 84.48 tok/s at the 524k profile and 79.82 tok/s at the 640k profile — confirming the headline numbers hold under realistic usage:

Real-world workload TPS — 524k vs 640k

Public-benchmark snapshot

Public benchmark accuracy — GSM8K, HumanEval, MMLU

Benchmark Sample Score
GSM8K (T=0, COT-style, exact-match on #### N) 100 93.0% (93/100)
HumanEval pass@1 (T=0, greedy, subprocess-exec tests) 164 (full) 96.3% (158/164)
MMLU (T=0, mixed subjects, max_tokens=128 to leave room for reasoning) 100 53.0% (53/100) ¹
Internal capability eval (math/code/reasoning/knowledge/instruction/longform/tools) 14 13/14 (92.9%)

The HumanEval result is the full 164-problem set with real unit-test execution (greedy T=0, content-only extraction, 60 s test timeout, sandboxed subprocess). Of the 6 failures: 3 were NameError (model named the function differently from the prompt's entry_point), 2 were AssertionError (real wrong answers), 1 was AttributeError (used list.add instead of list.append). Total wall time for the full run: 11.3 min at concurrency 2 on this hardware.

¹ The MMLU sample mixed all 57 subjects uniformly (incl. hard categories like formal_logic, college_*, professional_law). The model's main forward path is identical to pastapaul/DeepSeek-V4-Flash-W4A16-FP8, so headline numbers track that base — refer to base-model evals for tight numbers across full leaderboards.

How to run (production canonical, on this hardware family)

vLLM does not load this model on a vanilla install. You need:

  1. The patched vLLM fork: jasl/vllm at b158e5001 (or ds4-sm120-experimental tip), with the Acti MTP patches applied (one-line additions to deepseek_v4_mtp.py — see Patches below).
  2. The DSV4-Flash-FP8 toolchain: kylesayrs/deepseek-ct cherry-pick, the pasta-paul packed_modules_mapping patch, torch 2.11.0+cu128, compressed-tensors >= 0.15.0.1.
  3. On RTX PRO 6000 Max-Q workstation cards (no NVLink): you must pass --disable-custom-all-reduce — vLLM's CustomAllreduce uses CUDA P2P that deadlocks on this PCIe-only topology, independent of NCCL's NCCL_P2P_DISABLE.

A complete reproducible workspace, including all the patches and a one-shot install script, is at:

https://github.com/pasta-paul/dsv4-flash-w4a16-fp8 (base model + serving stack)

Plus the additional Acti changes documented in the "Acti additions" section below.

Docker deployment (recommended for portability)

A reproducible Docker image build is provided in docker/. The image bakes in the patched vLLM fork, applies all three Acti MTP-loader patches inline via a deterministic patcher script, and exposes a vLLM OpenAI-compatible API on port 8000. Model weights are mounted as a volume so the image stays ~12 GB (instead of ~160 GB).

# 1. Clone the build assets
git clone https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 dsv4
cd dsv4/docker

# 2. Build the image (~25-45 min for CUDA kernel compile)
docker build -t dsv4-flash-acti-mtp:0.1.0 .

# 3. Download the model weights
hf download LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --local-dir $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

# 4. Run (the production 524k config is the default)
docker run --rm --gpus all \
  --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  -v $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:/models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:ro \
  dsv4-flash-acti-mtp:0.1.0

Switch profile via env vars at run time. Two examples:

# High-concurrency 128k (8 concurrent users, 467 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=131072 -e MAX_NUM_SEQS=8 -e GPU_MEMORY_UTILIZATION=0.90 \
  dsv4-flash-acti-mtp:0.1.0

# Medium 256k (4 concurrent users, 296 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=262144 -e MAX_NUM_SEQS=4 -e GPU_MEMORY_UTILIZATION=0.95 \
  dsv4-flash-acti-mtp:0.1.0

To publish your build to a registry (Docker Hub / ghcr.io / private):

docker tag dsv4-flash-acti-mtp:0.1.0 <your-namespace>/dsv4-flash-acti-mtp:0.1.0
docker push <your-namespace>/dsv4-flash-acti-mtp:0.1.0

The HF repo holds the Dockerfile, entrypoint, patcher script, and a docker-compose example — not the compiled image binary itself (which exceeds practical HF LFS limits). Build it on the host that will serve. See docker/README.md for full details.

Validated 524k serve command (bare-metal vLLM)

vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --served-model-name deepseek-v4-flash deepseek-v4-flash-mtp \
                       DSV4-W4A16-FP8 deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 --block-size 256 \
  --max-model-len 524288 \
  --max-num-seqs 2 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.93 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --host 0.0.0.0 --port 8000

Required env (Acti's working set, includes the small-msg AR latency tuning that drops TTFT from 154 ms to 91 ms on Max-Q):

export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
export NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_SHM_DISABLE=0
export NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512

For shorter context with two streams concurrently, drop --max-model-len to 262144 and --max-num-seqs to 2 (still 2× concurrency at 1.7×).

Architecture (unchanged from the base model + MTP head added)

Property Value
Total parameters 284 B (13 B activated)
Decoder layers 43 + 1 MTP layer
Routed experts / layer 256 (top-K = 6 with noaux_tc routing) + 1 shared expert
Hidden size 4096
Routed expert intermediate 2048
Vocab size 129 280
Max position embeddings 1 048 576
num_nextn_predict_layers 1
Quantized size on disk ~146 GB (vs. ~543 GB BF16)

Quantization scheme (per-tensor)

Component Format Method
Routed experts (256 × 43 main layers) W4A16 INT4 group=128 sym GPTQ (pasta-paul, dampening_frac=0.1)
Routed experts (256 × MTP layer) W4A16 INT4 group=128 sym GPTQ (Acti, Frantar-style with Cholesky H⁻¹, damp=0.01)
Attention projections (q/kv/o, compressor, indexer) FP8_BLOCK 128×128 data-free (carries through from upstream)
MTP attention FP8_BLOCK 128×128 upstream MX-FP8 → vLLM-format FP8_BLOCK (Acti's renamed scale field)
Shared experts BF16 excluded
lm_head, embeddings, hc_*, norms, gates, attn_sinks, MTP e_proj/h_proj BF16/FP32 excluded

How the MTP routed-expert GPTQ was done

  1. Boot pasta-paul base model in vLLM with the v1 RTN MTP block + an instrumentation patch that dumps the MTP layer's input arrays (previous_hidden_states, inputs_embeds) to disk per forward call. Run in --enforce-eager (compiled-mode + dynamo doesn't permit the file I/O hook in-graph).
  2. Send 256 ultrachat_200k prompts × max_tokens=256 through the server. ~17.7k MTP forward calls captured, 473k tokens total.
  3. Stop the server. Build a BF16 MTP block in pure PyTorch from the upstream FP8 master (deepseek-ai/DeepSeek-V4-Flash shard 46), dequantizing NVFP4 routed-expert weights via E2M1 lookup × MX-FP8 scale.
  4. Replay each captured (previous_hidden_states, inputs_embeds) through a simplified MTP forward (RMSNorms + h_proj/e_proj real; attention skipped, treated as identity-residual; FFN gate routes tokens to top-K experts; per-expert input activations gathered).
  5. For each of 256 experts × 3 projections (w1/w3/w2):
    • w1/w3 share a Hessian H computed from the FFN gate's routed inputs.
    • w2 Hessian is from silu(w1·x) ⊙ (w3·x) intermediate activations.
    • Run Frantar's GPTQ algorithm with Cholesky H⁻¹ updates (damp=0.01); per-row, per-128-element-group symmetric INT4 quantization.
  6. Pack to compressed-tensors W4A16 INT4 format using the canonical compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32 (8 sym-int4 per int32, LSB-first, +8 offset).
  7. Splice into a new shard, replacing the v1 RTN values; format byte-for-byte identical to the base ckpt's main-layer expert format.

Per-expert calibration tokens: min=430, max=175,375, mean=11,094. The min is borderline for a 4096-dim Hessian; those experts will be slightly less optimal but the damping factor compensates. The simplification of skipping attention during replay is a pragmatic compromise vs. re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch (it gives a less-perfect input distribution but still beats RTN by a measurable margin).

Acti additions (vs pastapaul/DeepSeek-V4-Flash-W4A16-FP8)

  1. MTP weights present. 768 routed-expert tensors (mtp.0.ffn.experts.{0..255}.{w1,w2,w3}) packed in compressed-tensors W4A16 INT4 (group=128 sym), 5 attention projections at FP8_BLOCK with the field rename, 3 BF16 shared experts, and the BF16/FP32 norms / e_proj / h_proj / enorm / hnorm / attn_norm / attn_sink / gate / hc_* tensors. New shard: model-mtp-w4a16.safetensors (3.55 GB).
  2. Updated quantization_config.ignore that excludes the MTP non-quantized layer prefixes (layers.43.{e_proj, h_proj, shared_head.head, shared_experts.{w1,w2,w3}} and mtp_block.* aliases) from the W4A16/FP8 groups while keeping the routed experts in the W4A16 group_1 regex.
  3. vLLM patches (mirror these into your fork; total ~30 lines):
    • vllm/model_executor/models/deepseek_v4_mtp.py:
      • Pass prefix=f"{prefix}.e_proj" / {prefix}.h_proj to the ReplicatedLinear instances inside DeepSeekV4MultiTokenPredictorLayer.
      • Add class attribute packed_modules_mapping = {"fused_wqa_wkv": ["wq_a", "wkv"], "fused_wkv_wgate": ["wkv", "wgate"], "gate_up_proj": ["w1", "w3"]} to DeepSeekV4MTP so the MTP block's attention fuses into attn.fused_wqa_wkv.
      • In load_weights, when remapping .scale.weight_scale* for non-expert tensors, use .weight_scale (not .weight_scale_inv) — pasta-paul's compressed-tensors FP8_BLOCK convention.
  4. GPTQ quantization tool (mtp-local-requant/gptq/run_gptq.py in Acti's repro repo) — standalone, CPU/single-GPU, no llmcompressor dependency. Reproducible from upstream shard 46 + the captured calibration dumps.

Provenance

This checkpoint was constructed by:

  1. Starting from pastapaul/DeepSeek-V4-Flash-W4A16-FP8 (4 safetensors shards, ~143 GB).
  2. Pulling shard 46 of deepseek-ai/DeepSeek-V4-Flash (the upstream FP8 master) — that shard contains all 1575 mtp.0.* tensors.
  3. For each of the 768 MTP routed-expert tensors: dequantize NVFP4 → BF16, then run Frantar-style GPTQ using calibration activations captured from a live serving of pasta-paul + v1 RTN MTP. Pack via compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32.
  4. For each of the 5 MTP attention projections: keep upstream's FP8 weight as-is, decode the upstream MX-FP8 scale to BF16 and rename .scale.weight_scale (pasta-paul convention).
  5. For BF16 / FP32 / shared-expert tensors: dequantize FP8 to BF16 where applicable, otherwise pass through.
  6. Write a single new shard model-mtp-w4a16.safetensors, hardlink the 4 base shards into the new dir, write a union safetensors index and an updated config with the rebuilt ignore list.

The full pipeline is reproducible in ~30 minutes total: ~25 min calibration capture (eager-mode serve + 256 prompts), ~3 min Hessian accumulation + GPTQ on a single GPU, ~2 min splice + manifest writeout.

Files in this repo

  • model-{00001..00004}-of-00004.safetensors — pasta-paul's base shards (W4A16 routed experts, FP8_BLOCK attention, BF16 shared experts) — redistributed under Apache 2.0 with attribution.
  • model-mtp-w4a16.safetensors — Acti's MTP layer (3.55 GB; GPTQ-quantized in v2).
  • model.safetensors.index.json — union index.
  • config.json — base config + rebuilt quantization_config.ignore for MTP submodules.
  • MTP_QUANT_MANIFEST.{json,md} — declares is_final_quality_preserving: true (real GPTQ).
  • recipe.yaml — pasta-paul's GPTQ recipe (for the base layers; included for transparency).
  • tokenizer.json, tokenizer_config.json, generation_config.json — unchanged from base.

Limitations

  • Hardware: validated only on 2× RTX PRO 6000 Blackwell Max-Q (sm_120). Should also work on RTX PRO 6000 Server, DGX Spark / GB10, and 8× H200 — same code path as base. Without --disable-custom-all-reduce, Max-Q deadlocks at post-graph eager warmup.
  • TP: TP=2 only. TP=1 OOMs on a single 96 GB GPU; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
  • MTP quality: GPTQ pipeline used a simplified MTP forward that skipped the attention layer in calibration (used identity-residual). This is a known approximation chosen to avoid re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch — it captures the post-FFN-norm input distribution to each expert, but the attention's contribution to that distribution is approximated. The output produced by the served model is unaffected (the main model verifies every accepted draft); only the MTP draft accept rate is impacted. Lift on this hardware vs. v1 RTN: ~+2% decode TPS on long-context profile.
  • num_speculative_tokens: capped at 1 because DSV4-Flash ships exactly one MTP head (num_nextn_predict_layers=1). Higher values would not produce more draft tokens.
  • Reasoning parser: with --reasoning-parser deepseek_v4, model output is split into content and reasoning_content — applications that read only content will see empty strings for "thinking" responses. Adjust accordingly.

Future work

A higher-fidelity GPTQ pass on the MTP block would re-implement DSV4 hybrid attention (CSA+HCA) in pure PyTorch (or hook it directly out of vLLM's compiled forward) so calibration sees the real post-attention residual distribution rather than the identity-residual approximation we use today. Estimated additional lift: ~+5-10% decode TPS, requiring a ~1-day code change. Documented in the project's MTP_LOCAL_REQUANT_STATUS.md follow-ups.

Citation / attribution

If this model helps you, please cite both:

  • DeepSeek-AI for deepseek-ai/DeepSeek-V4-Flash — the source model.
  • pasta-paul for pastapaul/DeepSeek-V4-Flash-W4A16-FP8 — the W4A16 GPTQ quantization and the validated jasl/vllm serving stack (repo).
  • Acti for the MTP-head requantization (RTN v1, GPTQ v2) and the vLLM-MTP loader patches in this repo.

License: Apache 2.0 (matching upstream).

Downloads last month
2,723
Safetensors
Model size
44B params
Tensor type
I64
·
F32
·
I32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

Quantized
(1)
this model