Instructions to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with HERMES:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

llama-cpp-python

How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF",
	filename="Carnice-Qwen3.6-MoE-35B-A3B-F16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Ollama:
```
ollama run hf.co/samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
```

Unsloth Studio

How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF to start chatting

How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Docker Model Runner:
```
docker model run hf.co/samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
```

Lemonade

How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Carnice-Qwen3.6-MoE-35B-A3B-GGUF-Q4_K_M

List all available models

lemonade list

Carnice Qwen3.6 MoE 35B-A3B — Hermes-Focused Agentic Model (GGUF)

QLoRA fine-tune of Qwen3.6-35B-A3B (MoE, 3B active parameters) optimized for agentic workflows and Hermes Agent runtime. Two-stage training adapted from kai-os/Carnice-9b.

This is the successor to Carnice-MoE-35B-A3B (based on Qwen3.5), retrained on the newer Qwen3.6 base which brings improved agentic coding, extended context (262K native, up to 1M with RoPE scaling), and native multimodal support.

Credits

Training methodology adapted from kai-os/Carnice-9b — same two-stage approach and datasets, applied to the larger MoE architecture. Key inspiration: training on actual Hermes Agent execution traces for native agentic behavior.

Available Quantizations

Quantization	Size	Min VRAM
F16	65 GB	1x 98GB GPU
Q8_0	35 GB	1x 48GB GPU
Q6_K	27 GB	1x 32GB GPU
Q5_K_M	24 GB	1x 32GB GPU
Q4_K_M	20 GB	1x 24GB GPU
Q4_K_S	19 GB	1x 24GB GPU

For BF16 safetensors, see samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B.

Model Details

Property	Value
Base Model	Qwen/Qwen3.6-35B-A3B
Architecture	Mixture of Experts (MoE)
Total Parameters	~35B
Active Parameters	~3B per token
Native Context Length	262,144 tokens
Thinking Modes	Thinking / Non-thinking (native Qwen3.6)

What Makes This Different

Unlike generic reasoning distillation, this model was trained on actual Hermes Agent execution traces — real conversations where an AI agent:

Executes terminal commands and processes output
Performs file editing operations
Chains multi-step tool calls with results feeding back
Uses browser-assisted workflows
Makes decisions based on environmental feedback

This teaches the model the exact conversation patterns Hermes expects, rather than just generic reasoning.

Training Details

Two-Stage Approach

Stage A — Reasoning Repair (1 epoch)

Strengthens base model reasoning before agent-specific training
Loss: 0.4281

Dataset	Examples
bespokelabs/Bespoke-Stratos-17k	16,710
AI-MO/NuminaMath-CoT	17,000 (capped)

Stage B — Hermes Traces (2 epochs)

Agent-specific behavioral training on real execution traces
Loss: 0.3045

Dataset	Examples
kai-os/carnice-glm5-hermes-traces	1,627 (high quality)
open-thoughts/OpenThoughts-Agent-v1-SFT	15,209

Training Configuration

Parameter	Stage A	Stage B
LoRA Rank	64	64
LoRA Alpha	64	64
LoRA Targets	q, k, v, o projections	q, k, v, o projections
Learning Rate	2e-5 (linear)	1e-5 (cosine)
Epochs	1	2
Effective Batch	12	12
Context Length	4096	4096
Precision	4-bit QLoRA + BF16 adapters	Same
GPU	RTX PRO 6000 Blackwell (98GB)	Same
Total Training Time	~55 hours (both stages)

Trainable Parameters

13,762,560 (0.04% of 35.1B total)

Usage with llama.cpp

# Download a quantization (e.g., Q8_0)
huggingface-cli download samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF \
  Carnice-Qwen3.6-MoE-35B-A3B-Q8_0.gguf --local-dir .

# Run with llama-server
llama-server \
  --model Carnice-Qwen3.6-MoE-35B-A3B-Q8_0.gguf \
  --n-gpu-layers -1 \
  --ctx-size 262144 \
  --host 0.0.0.0 --port 8000

Acknowledgements

kai-os — Carnice training methodology and Hermes traces dataset
open-thoughts — Agent SFT dataset
bespokelabs — Bespoke-Stratos reasoning dataset
Unsloth — QLoRA training framework
Qwen — Base model

Downloads last month: 3,742

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(418)

this model

samuelcardillo
/

Carnice-Qwen3.6-MoE-35B-A3B-GGUF