NVIDIA-Nemotron-Labs-3-Elastic-12B-A2B-mxfp4-mlx

Brainwaves

         arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.433,0.624,0.807,0.593,0.390,0.752,0.602
qx86-hi  0.454,0.658,0.821,0.636,0.412,0.770,0.639
qx64-hi  0.442,0.643,0.824,0.631,0.410,0.765,0.635
mxfp4    0.444,0.634,0.823,0.622,0.390,0.765,0.628

Quant    Perplexity      Peak Memory   Tokens/sec
mxfp8    6.228 ± 0.048   17.04 GB      2165
qx86-hi  5.269 ± 0.039   15.18 GB      2128
qx64-hi  5.407 ± 0.040   12.08 GB      2326
mxfp4    5.786 ± 0.044   10.83 GB      1788

This model NVIDIA-Nemotron-Labs-3-Elastic-12B-A2B-mxfp4-mlx was converted to MLX format from DavidAU/NVIDIA-Nemotron-Labs-3-Elastic-12B-A2B using mlx-lm version 0.31.3.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("NVIDIA-Nemotron-Labs-3-Elastic-12B-A2B-mxfp4-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
3,449
Safetensors
Model size
12B params
Tensor type
U8
·
U32
·
BF16
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nightmedia/NVIDIA-Nemotron-Labs-3-Elastic-12B-A2B-mxfp4-mlx