Instructions to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced",
	filename="Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-IQ2_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Use Docker

docker model run hf.co/HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

LM Studio
Jan

vLLM

How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Ollama
How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with Ollama:
```
ollama run hf.co/HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M
```

Unsloth Studio

How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced to start chatting

How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with Docker Model Runner:
```
docker model run hf.co/HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M
```

Lemonade

How to use HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced:Q4_K_M

Run and chat with the model

lemonade run user.Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_M

List all available models

lemonade list

Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced

Join the Discord for updates, roadmaps, projects, or just to chat.

Gemma4-26B-A4B uncensored by HauhauCS. 0/465 Refusals* Release Candidate after over 1 month of nonstop work on this one.

HuggingFace's "Hardware Compatibility" widget doesn't recognize K_P quants — it may show fewer files than actually exist. Click "View +X variants" or go to Files and versions to see all available downloads.

About

GenRM Defeated!

No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended — just without the refusals.

These are meant to be the best lossless uncensored models out there.

Balanced — Release Candidate

This legitimately took me over 1 month of non-stop work. Targeting 0 refusals in standard use, and that's what I'm seeing in testing (automated and manual) — a handful of edge-case prompts still deflect on first try but follow through on a re-ask. If you hit one Balanced won't get past, the Aggressive variant is coming once I figure out how to maintain lossless/near-lossless quality for it.

Balanced: will reason through edgy requests, occasionally attach a short safety framing, then deliver the full answer. Output is complete, nothing held back, but it can talk itself into it first. Recommended default — 99%+ of users will be happy here. Best for creative writing, RP, emotional intelligence. Normally I'd also say "agentic coding/tool use" however in my in-depth testing, Qwen3.6 has been net superior on such tasks. Do be mindful of the few deflection categories I mentioned already.
Aggressive (separate release, WIP): strips the self-reasoning preamble and gives direct answers to any DEEPLY censored topics.

Balanced also has meaningfully more stable sampling across re-runs, which matters for long context sessions — no sporadic topic drift deep.

Downloads

File	Quant	BPW	Size
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q8_K_P.gguf	Q8_K_P	8.64	27 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q6_K_P.gguf	Q6_K_P	7.21	23 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_P.gguf	Q5_K_P	6.12	19 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf	Q5_K_M	6.06	19 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf	Q4_K_P	5.36	17 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_M.gguf	Q4_K_M	5.32	17 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-IQ4_XS.gguf	IQ4_XS	4.41	14 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q3_K_P.gguf	Q3_K_P	4.25	13 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q3_K_M.gguf	Q3_K_M	4.21	13 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-IQ3_M.gguf	IQ3_M	3.93	12 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q2_K_P.gguf	Q2_K_P	3.39	11 GB
Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-IQ2_M.gguf	IQ2_M	3.29	10 GB
mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf	mmproj (f16)	—	1.2 GB

BPW is slightly higher than nominal across the board because Gemma4 has a lot of per-layer norm/scale tensors kept at F32 (multiple post-ffw norms per layer). All quants generated with importance matrix (imatrix) for optimal quality preservation on uncensored weights.

What are K_P quants?

K_P ("Perfect") quants are HauhauCS custom quantizations that use model-specific analysis to selectively preserve quality where it matters most. Each model gets its own optimized quantization profile — the top 25% most-important tensors (per imatrix calibration) are promoted to a higher quant type.

A K_P quant effectively bumps quality up by 1-2 quant levels at only ~5-15% larger file size than the base quant. Fully compatible with llama.cpp, LM Studio, and any GGUF-compatible runtime — no special builds needed.

Note: K_P quants may show as "?" in LM Studio's quant column. This is a display issue only — the model loads and runs fine.

Why this model for agentic work

26B total params with only ~4B active per forward pass (top-8 of 128 experts). You get the reasoning footprint of a 26B with the throughput of a ~4B for inference cost — which matters when you're chaining 10+ tool calls per task. Sliding-window attention (1024 tokens) plus periodic full attention keeps long contexts cheap without losing global coherence.

Balanced is calibrated for this. It removes refusals on security/ops/research-adjacent topics that block legitimate coding work, without bending the sampling geometry that keeps long chains coherent.

Recommended quant for most coding work: Q4_K_P (17 GB, fits in 24 GB VRAM with room for context) or Q8_K_P (27 GB) if you have more VRAM and want maximum quality with minimal offloading.

Do note - main usecase for Gemma4 is Creative Writing, Roleplaying and Emotional Intelligence.

Specs

25.2B total / 3.8B active params (128 routed experts, top-8 + 1 shared expert)
30 layers, hybrid attention: 5× sliding-window (1024 tokens) → 1× full global, repeating. Uses Proportional RoPE (p-RoPE).
Hidden dim 2816, FFN dim 2112, MoE expert FFN 704, vocab 262144
Head dim 256 (SWA) / 512 (full), 16 attention heads, 8 KV heads (2 for full layers)
256K native context
Natively multimodal (text + vision) — ships with mmproj. Variable visual token budgets: 70 / 140 / 280 / 560 / 1120 per image.
Based on google/gemma-4-26B-A4B-it

Recommended Settings

From the official Gemma authors:

Inference parameters:

temperature=1.0, top_p=0.95, top_k=64

Important:

Use --jinja with llama.cpp for proper chat template handling
Vision support requires the mmproj file alongside the main GGUF. Place images before text in your prompt for best vision performance.
Keep at least 32K context for serious agentic work; the model can take much more (256K native) if you need it
Sliding window is baked into the architecture — no special flag needed

Turning Thinking On/Off

Gemma4 has thinking mode controlled via enable_thinking in the chat template. It's the same pattern as Qwen3.6 — set false for faster, shorter replies and true (default) when you want chain-of-thought.

LM Studio

Load the model
Right-side settings panel → Model Settings → Prompt Template (or Chat Template Options)
Set enable_thinking to false (or true) in the template kwargs

llama.cpp

llama-server — set as default for all requests:

llama-server -m Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \
  --mmproj mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf \
  --jinja -c 32768 -ngl 99 \
  --chat-template-kwargs '{"enable_thinking": false}'

Per-request via the OpenAI-compatible API:

{
  "model": "gemma4-26b-a4b",
  "messages": [{"role": "user", "content": "..."}],
  "chat_template_kwargs": {"enable_thinking": false}
}

Usage

Works with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF-compatible runtimes.

llama-server:

llama-server -m Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \
  --mmproj mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf \
  --jinja -c 32768 -ngl 99

llama-cli:

llama-cli -m Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \
  --mmproj mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf \
  --jinja -c 32768 -ngl 99

Other Models

HauhauCS on HuggingFace

* Tested with both automated and manual refusal benchmarks — none have been found in standard use. A small number of edge-case prompts deflect on the first ask but comply on a re-ask or strategic framing. If you hit one that's actually obstructive to your use case, join the Discord and flag it so I can work on it in a future revision.

Downloads last month: 168,138

GGUF

Model size

25B params

Architecture

gemma4

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

View +1 variant

Model tree for HauhauCS/Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Quantized

(223)

this model