Instructions to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- HERMES
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- llama-cpp-python
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF", filename="Carnice-Qwen3.6-MoE-35B-A3B-F16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Ollama:
ollama run hf.co/samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
- Unsloth Studio
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF to start chatting
- Pi
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Docker Model Runner:
docker model run hf.co/samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
- Lemonade
How to use samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Carnice-Qwen3.6-MoE-35B-A3B-GGUF-Q4_K_M
List all available models
lemonade list
Carnice Qwen3.6 MoE 35B-A3B — Hermes-Focused Agentic Model (GGUF)
QLoRA fine-tune of Qwen3.6-35B-A3B (MoE, 3B active parameters) optimized for agentic workflows and Hermes Agent runtime. Two-stage training adapted from kai-os/Carnice-9b.
This is the successor to Carnice-MoE-35B-A3B (based on Qwen3.5), retrained on the newer Qwen3.6 base which brings improved agentic coding, extended context (262K native, up to 1M with RoPE scaling), and native multimodal support.
Credits
Training methodology adapted from kai-os/Carnice-9b — same two-stage approach and datasets, applied to the larger MoE architecture. Key inspiration: training on actual Hermes Agent execution traces for native agentic behavior.
Available Quantizations
| Quantization | Size | Min VRAM |
|---|---|---|
| F16 | 65 GB | 1x 98GB GPU |
| Q8_0 | 35 GB | 1x 48GB GPU |
| Q6_K | 27 GB | 1x 32GB GPU |
| Q5_K_M | 24 GB | 1x 32GB GPU |
| Q4_K_M | 20 GB | 1x 24GB GPU |
| Q4_K_S | 19 GB | 1x 24GB GPU |
For BF16 safetensors, see samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.6-35B-A3B |
| Architecture | Mixture of Experts (MoE) |
| Total Parameters | ~35B |
| Active Parameters | ~3B per token |
| Native Context Length | 262,144 tokens |
| Thinking Modes | Thinking / Non-thinking (native Qwen3.6) |
What Makes This Different
Unlike generic reasoning distillation, this model was trained on actual Hermes Agent execution traces — real conversations where an AI agent:
- Executes terminal commands and processes output
- Performs file editing operations
- Chains multi-step tool calls with results feeding back
- Uses browser-assisted workflows
- Makes decisions based on environmental feedback
This teaches the model the exact conversation patterns Hermes expects, rather than just generic reasoning.
Training Details
Two-Stage Approach
Stage A — Reasoning Repair (1 epoch)
- Strengthens base model reasoning before agent-specific training
- Loss: 0.4281
| Dataset | Examples |
|---|---|
| bespokelabs/Bespoke-Stratos-17k | 16,710 |
| AI-MO/NuminaMath-CoT | 17,000 (capped) |
Stage B — Hermes Traces (2 epochs)
- Agent-specific behavioral training on real execution traces
- Loss: 0.3045
| Dataset | Examples |
|---|---|
| kai-os/carnice-glm5-hermes-traces | 1,627 (high quality) |
| open-thoughts/OpenThoughts-Agent-v1-SFT | 15,209 |
Training Configuration
| Parameter | Stage A | Stage B |
|---|---|---|
| LoRA Rank | 64 | 64 |
| LoRA Alpha | 64 | 64 |
| LoRA Targets | q, k, v, o projections | q, k, v, o projections |
| Learning Rate | 2e-5 (linear) | 1e-5 (cosine) |
| Epochs | 1 | 2 |
| Effective Batch | 12 | 12 |
| Context Length | 4096 | 4096 |
| Precision | 4-bit QLoRA + BF16 adapters | Same |
| GPU | RTX PRO 6000 Blackwell (98GB) | Same |
| Total Training Time | ~55 hours (both stages) |
Trainable Parameters
13,762,560 (0.04% of 35.1B total)
Usage with llama.cpp
# Download a quantization (e.g., Q8_0)
huggingface-cli download samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF \
Carnice-Qwen3.6-MoE-35B-A3B-Q8_0.gguf --local-dir .
# Run with llama-server
llama-server \
--model Carnice-Qwen3.6-MoE-35B-A3B-Q8_0.gguf \
--n-gpu-layers -1 \
--ctx-size 262144 \
--host 0.0.0.0 --port 8000
Acknowledgements
- kai-os — Carnice training methodology and Hermes traces dataset
- open-thoughts — Agent SFT dataset
- bespokelabs — Bespoke-Stratos reasoning dataset
- Unsloth — QLoRA training framework
- Qwen — Base model
- Downloads last month
- 3,742
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for samuelcardillo/Carnice-Qwen3.6-MoE-35B-A3B-GGUF
Base model
Qwen/Qwen3.6-35B-A3B