DeepSeek-V4-Flash · MTP · W4A16-FP8 · Acti

MTP-enabled (Multi-Token Prediction) variant of pastapaul/DeepSeek-V4-Flash-W4A16-FP8, prepared by Acti for vLLM tensor-parallel deployment at TP=2 with self-speculative decoding turned on.

The base model is unchanged. We added back the DSV4 MTP head (which the original quantization stripped because HF transformers' DSV4 modeling code drops mtp.* keys at load time) so vLLM can run --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and get a real wall-clock decode speedup.

✅ v2 release: real GPTQ quantization on the MTP routed experts (replaces the v1 RTN loader-format experiment). Calibrated with 256 ultrachat_200k prompts × 256 max_tokens captured from the running pasta-paul model, processed by a Frantar-style GPTQ pass with Cholesky H⁻¹. Per-expert mean: ~11k calibration tokens, min 430, max 175k. Manifest now declares is_final_quality_preserving: true.

Results — single-stream decode TPS, greedy, vs no-MTP baseline

Hardware: 2× NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (sm_120, 96 GB ea., no NVLink, NUMA NODE PCIe), driver 580.126.09, NCCL 2.28.9, jasl/vllm b158e5001 + cherry-picks + Acti's MTP-loader patches.

Profile	Decode TPS	TTFT	Δ vs base
Base (pasta-paul, no MTP), 524k	52.85	91 ms	0% (reference)
This model (v2 GPTQ), 524k	85.52	155 ms	+62% (1.62×)
This model (v2 GPTQ), 128k single-stream	~111	~310 ms	+110% (2.10×)

(v1 RTN release was 83.97 tok/s @ 524k; v2 GPTQ adds ~+2% on top of that. Most of the speedup over base comes from MTP self-speculation; the GPTQ vs RTN delta is the MTP draft acceptance-rate improvement from better expert weights.)

Concurrency profiles — real measurements

The single-stream numbers above are one point on a curve. Below are the multi-user concurrency results from a thread-pool sweep on the same hardware. Different --max-model-len profiles trade off context length against how many concurrent users you can serve.

Profile	Max concurrent users	Per-stream TPS at max	Aggregate TPS	Recommended for
128k (`--max-model-len 131072 --max-num-seqs 8 --gpu-memory-utilization 0.90`)	8	57	467	High-concurrency chat / agent fleet
256k (`--max-model-len 262144 --max-num-seqs 4 --gpu-memory-utilization 0.95`)	4	76	296	Medium-concurrency, longer docs / RAG
524k (`--max-model-len 524288 --max-num-seqs 3 --gpu-memory-utilization 0.93`)	3	46	138	Single/few users, max context

Per-stream scaling within each profile:

N concurrent	128k profile	256k profile	524k profile
1	75	73	87
2	80	75	62
3	(anomalous)	(anomalous)	46
4	76	76	—
8	57	—	—

(The N=3 anomaly is reproducible — likely a vLLM MTP-draft batching artifact at odd counts. Even N values scale cleanly.)

Real-world workload sweep — 15-prompt suite

Beyond the synthetic decode bench, a 15-prompt real-world suite (chat, coding, retrieval, writing, planning, with system prompt + tools enabled, no max_tokens cap) averaged 84.48 tok/s at the 524k profile and 79.82 tok/s at the 640k profile — confirming the headline numbers hold under realistic usage:

Public-benchmark snapshot

Benchmark	Sample	Score
GSM8K (T=0, COT-style, exact-match on `#### N`)	100	93.0% (93/100)
HumanEval pass@1 (T=0, greedy, subprocess-exec tests)	164 (full)	96.3% (158/164)
MMLU (T=0, mixed subjects, max_tokens=128 to leave room for reasoning)	100	53.0% (53/100) ¹
Internal capability eval (math/code/reasoning/knowledge/instruction/longform/tools)	14	13/14 (92.9%)

The HumanEval result is the full 164-problem set with real unit-test execution (greedy T=0, content-only extraction, 60 s test timeout, sandboxed subprocess). Of the 6 failures: 3 were NameError (model named the function differently from the prompt's entry_point), 2 were AssertionError (real wrong answers), 1 was AttributeError (used list.add instead of list.append). Total wall time for the full run: 11.3 min at concurrency 2 on this hardware.

¹ The MMLU sample mixed all 57 subjects uniformly (incl. hard categories like formal_logic, college_*, professional_law). The model's main forward path is identical to pastapaul/DeepSeek-V4-Flash-W4A16-FP8, so headline numbers track that base — refer to base-model evals for tight numbers across full leaderboards.

How to run (production canonical, on this hardware family)

vLLM does not load this model on a vanilla install. You need:

The patched vLLM fork: jasl/vllm at b158e5001 (or ds4-sm120-experimental tip), with the Acti MTP patches applied (one-line additions to deepseek_v4_mtp.py — see Patches below).
The DSV4-Flash-FP8 toolchain: kylesayrs/deepseek-ct cherry-pick, the pasta-paul packed_modules_mapping patch, torch 2.11.0+cu128, compressed-tensors >= 0.15.0.1.
On RTX PRO 6000 Max-Q workstation cards (no NVLink): you must pass --disable-custom-all-reduce — vLLM's CustomAllreduce uses CUDA P2P that deadlocks on this PCIe-only topology, independent of NCCL's NCCL_P2P_DISABLE.

A complete reproducible workspace, including all the patches and a one-shot install script, is at:

https://github.com/pasta-paul/dsv4-flash-w4a16-fp8 (base model + serving stack)

Plus the additional Acti changes documented in the "Acti additions" section below.

Docker deployment (recommended for portability)

A reproducible Docker image build is provided in docker/. The image bakes in the patched vLLM fork, applies all three Acti MTP-loader patches inline via a deterministic patcher script, and exposes a vLLM OpenAI-compatible API on port 8000. Model weights are mounted as a volume so the image stays ~12 GB (instead of ~160 GB).

# 1. Clone the build assets
git clone https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 dsv4
cd dsv4/docker

# 2. Build the image (~25-45 min for CUDA kernel compile)
docker build -t dsv4-flash-acti-mtp:0.1.0 .

# 3. Download the model weights
hf download LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --local-dir $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

# 4. Run (the production 524k config is the default)
docker run --rm --gpus all \
  --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  -v $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:/models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:ro \
  dsv4-flash-acti-mtp:0.1.0

Switch profile via env vars at run time. Two examples:

# High-concurrency 128k (8 concurrent users, 467 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=131072 -e MAX_NUM_SEQS=8 -e GPU_MEMORY_UTILIZATION=0.90 \
  dsv4-flash-acti-mtp:0.1.0

# Medium 256k (4 concurrent users, 296 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=262144 -e MAX_NUM_SEQS=4 -e GPU_MEMORY_UTILIZATION=0.95 \
  dsv4-flash-acti-mtp:0.1.0

To publish your build to a registry (Docker Hub / ghcr.io / private):

docker tag dsv4-flash-acti-mtp:0.1.0 <your-namespace>/dsv4-flash-acti-mtp:0.1.0
docker push <your-namespace>/dsv4-flash-acti-mtp:0.1.0

The HF repo holds the Dockerfile, entrypoint, patcher script, and a docker-compose example — not the compiled image binary itself (which exceeds practical HF LFS limits). Build it on the host that will serve. See docker/README.md for full details.

Validated 524k serve command (bare-metal vLLM)

vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --served-model-name deepseek-v4-flash deepseek-v4-flash-mtp \
                       DSV4-W4A16-FP8 deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 --block-size 256 \
  --max-model-len 524288 \
  --max-num-seqs 2 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.93 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --host 0.0.0.0 --port 8000

Required env (Acti's working set, includes the small-msg AR latency tuning that drops TTFT from 154 ms to 91 ms on Max-Q):

export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
export NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_SHM_DISABLE=0
export NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512

For shorter context with two streams concurrently, drop --max-model-len to 262144 and --max-num-seqs to 2 (still 2× concurrency at 1.7×).

Architecture (unchanged from the base model + MTP head added)

Property	Value
Total parameters	284 B (13 B activated)
Decoder layers	43 + 1 MTP layer
Routed experts / layer	256 (top-K = 6 with `noaux_tc` routing) + 1 shared expert
Hidden size	4096
Routed expert intermediate	2048
Vocab size	129 280
Max position embeddings	1 048 576
`num_nextn_predict_layers`	1
Quantized size on disk	~146 GB (vs. ~543 GB BF16)

Quantization scheme (per-tensor)

Component	Format	Method
Routed experts (256 × 43 main layers)	W4A16 INT4 group=128 sym	GPTQ (pasta-paul, `dampening_frac=0.1`)
Routed experts (256 × MTP layer)	W4A16 INT4 group=128 sym	GPTQ (Acti, Frantar-style with Cholesky H⁻¹, damp=0.01)
Attention projections (q/kv/o, compressor, indexer)	FP8_BLOCK 128×128	data-free (carries through from upstream)
MTP attention	FP8_BLOCK 128×128	upstream MX-FP8 → vLLM-format FP8_BLOCK (Acti's renamed scale field)
Shared experts	BF16	excluded
`lm_head`, embeddings, hc_*, norms, gates, attn_sinks, MTP `e_proj`/`h_proj`	BF16/FP32	excluded

How the MTP routed-expert GPTQ was done

Boot pasta-paul base model in vLLM with the v1 RTN MTP block + an instrumentation patch that dumps the MTP layer's input arrays (previous_hidden_states, inputs_embeds) to disk per forward call. Run in --enforce-eager (compiled-mode + dynamo doesn't permit the file I/O hook in-graph).
Send 256 ultrachat_200k prompts × max_tokens=256 through the server. ~17.7k MTP forward calls captured, 473k tokens total.
Stop the server. Build a BF16 MTP block in pure PyTorch from the upstream FP8 master (deepseek-ai/DeepSeek-V4-Flash shard 46), dequantizing NVFP4 routed-expert weights via E2M1 lookup × MX-FP8 scale.
Replay each captured (previous_hidden_states, inputs_embeds) through a simplified MTP forward (RMSNorms + h_proj/e_proj real; attention skipped, treated as identity-residual; FFN gate routes tokens to top-K experts; per-expert input activations gathered).
For each of 256 experts × 3 projections (w1/w3/w2):
- w1/w3 share a Hessian H computed from the FFN gate's routed inputs.
- w2 Hessian is from silu(w1·x) ⊙ (w3·x) intermediate activations.
- Run Frantar's GPTQ algorithm with Cholesky H⁻¹ updates (damp=0.01); per-row, per-128-element-group symmetric INT4 quantization.
Pack to compressed-tensors W4A16 INT4 format using the canonical compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32 (8 sym-int4 per int32, LSB-first, +8 offset).
Splice into a new shard, replacing the v1 RTN values; format byte-for-byte identical to the base ckpt's main-layer expert format.

Per-expert calibration tokens: min=430, max=175,375, mean=11,094. The min is borderline for a 4096-dim Hessian; those experts will be slightly less optimal but the damping factor compensates. The simplification of skipping attention during replay is a pragmatic compromise vs. re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch (it gives a less-perfect input distribution but still beats RTN by a measurable margin).

Acti additions (vs `pastapaul/DeepSeek-V4-Flash-W4A16-FP8`)

MTP weights present. 768 routed-expert tensors (mtp.0.ffn.experts.{0..255}.{w1,w2,w3}) packed in compressed-tensors W4A16 INT4 (group=128 sym), 5 attention projections at FP8_BLOCK with the field rename, 3 BF16 shared experts, and the BF16/FP32 norms / e_proj / h_proj / enorm / hnorm / attn_norm / attn_sink / gate / hc_* tensors. New shard: model-mtp-w4a16.safetensors (3.55 GB).
Updated quantization_config.ignore that excludes the MTP non-quantized layer prefixes (layers.43.{e_proj, h_proj, shared_head.head, shared_experts.{w1,w2,w3}} and mtp_block.* aliases) from the W4A16/FP8 groups while keeping the routed experts in the W4A16 group_1 regex.
vLLM patches (mirror these into your fork; total ~30 lines):
- vllm/model_executor/models/deepseek_v4_mtp.py:
  - Pass prefix=f"{prefix}.e_proj" / {prefix}.h_proj to the ReplicatedLinear instances inside DeepSeekV4MultiTokenPredictorLayer.
  - Add class attribute packed_modules_mapping = {"fused_wqa_wkv": ["wq_a", "wkv"], "fused_wkv_wgate": ["wkv", "wgate"], "gate_up_proj": ["w1", "w3"]} to DeepSeekV4MTP so the MTP block's attention fuses into attn.fused_wqa_wkv.
  - In load_weights, when remapping .scale → .weight_scale* for non-expert tensors, use .weight_scale (not .weight_scale_inv) — pasta-paul's compressed-tensors FP8_BLOCK convention.
GPTQ quantization tool (mtp-local-requant/gptq/run_gptq.py in Acti's repro repo) — standalone, CPU/single-GPU, no llmcompressor dependency. Reproducible from upstream shard 46 + the captured calibration dumps.

Provenance

This checkpoint was constructed by:

Starting from pastapaul/DeepSeek-V4-Flash-W4A16-FP8 (4 safetensors shards, ~143 GB).
Pulling shard 46 of deepseek-ai/DeepSeek-V4-Flash (the upstream FP8 master) — that shard contains all 1575 mtp.0.* tensors.
For each of the 768 MTP routed-expert tensors: dequantize NVFP4 → BF16, then run Frantar-style GPTQ using calibration activations captured from a live serving of pasta-paul + v1 RTN MTP. Pack via compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32.
For each of the 5 MTP attention projections: keep upstream's FP8 weight as-is, decode the upstream MX-FP8 scale to BF16 and rename .scale → .weight_scale (pasta-paul convention).
For BF16 / FP32 / shared-expert tensors: dequantize FP8 to BF16 where applicable, otherwise pass through.
Write a single new shard model-mtp-w4a16.safetensors, hardlink the 4 base shards into the new dir, write a union safetensors index and an updated config with the rebuilt ignore list.

The full pipeline is reproducible in ~30 minutes total: ~25 min calibration capture (eager-mode serve + 256 prompts), ~3 min Hessian accumulation + GPTQ on a single GPU, ~2 min splice + manifest writeout.

Files in this repo

model-{00001..00004}-of-00004.safetensors — pasta-paul's base shards (W4A16 routed experts, FP8_BLOCK attention, BF16 shared experts) — redistributed under Apache 2.0 with attribution.
model-mtp-w4a16.safetensors — Acti's MTP layer (3.55 GB; GPTQ-quantized in v2).
model.safetensors.index.json — union index.
config.json — base config + rebuilt quantization_config.ignore for MTP submodules.
MTP_QUANT_MANIFEST.{json,md} — declares is_final_quality_preserving: true (real GPTQ).
recipe.yaml — pasta-paul's GPTQ recipe (for the base layers; included for transparency).
tokenizer.json, tokenizer_config.json, generation_config.json — unchanged from base.

Limitations

Hardware: validated only on 2× RTX PRO 6000 Blackwell Max-Q (sm_120). Should also work on RTX PRO 6000 Server, DGX Spark / GB10, and 8× H200 — same code path as base. Without --disable-custom-all-reduce, Max-Q deadlocks at post-graph eager warmup.
TP: TP=2 only. TP=1 OOMs on a single 96 GB GPU; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
MTP quality: GPTQ pipeline used a simplified MTP forward that skipped the attention layer in calibration (used identity-residual). This is a known approximation chosen to avoid re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch — it captures the post-FFN-norm input distribution to each expert, but the attention's contribution to that distribution is approximated. The output produced by the served model is unaffected (the main model verifies every accepted draft); only the MTP draft accept rate is impacted. Lift on this hardware vs. v1 RTN: ~+2% decode TPS on long-context profile.
num_speculative_tokens: capped at 1 because DSV4-Flash ships exactly one MTP head (num_nextn_predict_layers=1). Higher values would not produce more draft tokens.
Reasoning parser: with --reasoning-parser deepseek_v4, model output is split into content and reasoning_content — applications that read only content will see empty strings for "thinking" responses. Adjust accordingly.

Future work

A higher-fidelity GPTQ pass on the MTP block would re-implement DSV4 hybrid attention (CSA+HCA) in pure PyTorch (or hook it directly out of vLLM's compiled forward) so calibration sees the real post-attention residual distribution rather than the identity-residual approximation we use today. Estimated additional lift: ~+5-10% decode TPS, requiring a ~1-day code change. Documented in the project's MTP_LOCAL_REQUANT_STATUS.md follow-ups.

Citation / attribution

If this model helps you, please cite both:

DeepSeek-AI for deepseek-ai/DeepSeek-V4-Flash — the source model.
pasta-paul for pastapaul/DeepSeek-V4-Flash-W4A16-FP8 — the W4A16 GPTQ quantization and the validated jasl/vllm serving stack (repo).
Acti for the MTP-head requantization (RTN v1, GPTQ v2) and the vLLM-MTP loader patches in this repo.

License: Apache 2.0 (matching upstream).

Downloads last month: 2,723

Safetensors

Model size

44B params

Tensor type

I64

F32

I32

BF16

F8_E4M3

Model tree for LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

canada-quant/DeepSeek-V4-Flash-W4A16-FP8