Instructions to use mlx-community/DeepSeek-V4-Flash-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/DeepSeek-V4-Flash-8bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/DeepSeek-V4-Flash-8bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use mlx-community/DeepSeek-V4-Flash-8bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mlx-community/DeepSeek-V4-Flash-8bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mlx-community/DeepSeek-V4-Flash-8bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/DeepSeek-V4-Flash-8bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Bundled chat_template.jinja is chat-only — strips tools silently
Heads up that the chat_template.jinja shipped in this repo (and across the V4-Flash quant variants) only renders system/user/assistant messages — there's no branch for the tool role, no iteration over the tools array, and no <tool_call> markers. So when an OpenAI-compatible client passes tools=[...], the array is silently dropped by apply_chat_template and the model never knows tools were available.
We picked this up while shipping day-0 V4 support in rapid-mlx (Apple Silicon MLX backend, PR #168). Plain chat works perfectly on both 2-bit DQ and 8-bit on a Mac Studio M3 Ultra (56/31 tok/s decode respectively, 7/8 stress scenarios pass), but our 30-scenario tool-calling eval scored 0/30 — every scenario logs tool_detected: False. Same outcome with Hermes and OpenClaude agent profiles.
Not a quant issue (identical 0/30 on 2-bit and 8-bit) and not a parser issue — the model literally never sees the tools list. Verified by inspecting the rendered prompt.
There's an active PR #16 upstream on deepseek-ai/DeepSeek-V4-Flash (by @Rocketknight1 , HF staff) adding a tool-supporting template, with a follow-up alternative @qgallouedec proposed. Would it be possible to pull whichever variant lands into the V4-Flash quant repos so users get tool calling out of the box?
Happy to test + report numbers once an updated template lands.
Thanks for the great quant work — the model itself runs beautifully on Apple Silicon.
I have tested the version shared by @Rocketknight1 in that PR, and it passes the same tool calling tests as the custom encoding code.