Instructions to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF",
	filename="LFM2.5-8B-A1B-Uncensored-Gaston-IQ4_XS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Use Docker

docker model run hf.co/gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Ollama
How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with Ollama:
```
ollama run hf.co/gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M
```

Unsloth Studio

How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF to start chatting

How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with Docker Model Runner:
```
docker model run hf.co/gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M
```

Lemonade

How to use gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.LFM2.5-8B-A1B-Uncensored-Gaston-GGUF-Q4_K_M

List all available models

lemonade list

LFM2.5-8B-A1B — Uncensored by Gastón Parravicini

First publicly available uncensored/abliterated GGUF of LFM2.5-8B-A1B.
Base model released May 28, 2026. This release: May 29, 2026.

TL;DR

Liquid AI dropped LFM2.5-8B-A1B yesterday. It refused everything. Today it doesn't.

Refusal rate: 1/100 on AdvBench. Reasoning intact. iMatrix quants. Same-day release.

About LFM2.5-8B-A1B

LFM2.5-8B-A1B is Liquid AI's latest edge model — a hybrid convolution + attention MoE architecture with:

8.3B total parameters, 1.5B active per token (MoE with 32 experts, 4 active)
128K context window (up from 32K in LFM2)
Trained on 38T tokens with large-scale reinforcement learning
Reasoning model — generates <think>...</think> chains before answering
Fastest in its class: 18,500 tokens/sec on H100 at high concurrency

The architecture is not a standard Transformer. It combines:

18 layers of gated short convolutions (LIV blocks) — O(n) complexity
6 layers of Grouped Query Attention (GQA) — O(n²) for global context
MoE feed-forward with sparse expert routing

This hybrid design is what makes it fast. It's also what makes abliteration non-trivial.

Why standard abliteration tools don't work here

Every existing abliteration tool — NousResearch, Heretic, OBLITERATUS — targets standard Transformer weight matrices:

self_attn.o_proj     ← doesn't exist in LFM2.5
mlp.down_proj        ← doesn't exist in LFM2.5

Running sharded_ablate.py on LFM2.5 without patching results in 0 shards modified. The model is completely unchanged. This is why no abliterated version existed before this release.

How this was done

1. Architecture reverse engineering

Full manual inspection of the LFM2.5 weight map to identify the correct abliteration targets:

Layer type          | Target matrix
--------------------|----------------------------------
Conv LIV block      | conv.out_proj    [2048, 2048]
GQA Attention block | self_attn.out_proj [2048, 2048]
Dense FFN (L0-L1)   | feed_forward.w2  [2048, 7168]

Key insight: conv.in_proj has shape [6144, 2048] — a 3x expansion projection that cannot be abliterated with the standard direction subtraction without a dimension mismatch error. Excluded intentionally.

2. Patch to sharded_ablate.py

# LFM2/LFM2.5 hybrid architecture support patch
# by Gastón Parravicini — May 29, 2026
# Enables abliteration of lfm2moe models in NousResearch/llm-abliteration

lfm2_patterns = [
    f"{layer_prefix}.layers.{layer}.self_attn.out_proj.weight",
    f"{layer_prefix}.layers.{layer}.conv.out_proj.weight",
    f"{layer_prefix}.layers.{layer}.feed_forward.w2.weight",
]

Without this patch: 0/10 shards modified.
With this patch: 6/10 shards modified, all correct targets.

3. Refusal direction analysis

Used analyze.py to map refusal signal strength across all 24 layers:

Layers	Est. Signal Quality	Type
0–2	~0.000	Skip
3–10	0.010–0.062	Low
11–17	0.108–0.240	Peak — abliterated here
18–23	0.049–0.145	High

Layer 16 was the peak signal layer (Est. Signal Quality: 0.242). Used as the primary measurement reference for all ablated layers.

4. Abliteration parameters

layers: 11–23
measurement: layer 16 (peak refusal signal)
scale: 2.0
flags: --projected --normpreserve

--projected: orthogonalizes the refusal direction against the harmless direction before subtracting — cleaner removal, less capability damage
--normpreserve: preserves weight matrix row norms after projection — prevents magnitude drift

5. Weight diff verification

Post-abliteration comparison against base model (via compare.py):

Metric	Value
Avg weight diff	~4–5 × 10⁻⁴
Max weight diff	~1–3%
Layers 0–10	Zero diff — untouched ✅
Layers 11–23	Surgical modifications only ✅

Modifications are minimal and targeted. The model's general capabilities are preserved.

Abliteration results

Metric	Result
Refusal rate (AdvBench 100 prompts)	1/100 (1%)
Reasoning (`<think>` tags)	✅ Fully intact
General capability	✅ Verified
Same-day release	✅ May 29, 2026

The single remaining refusal out of 100 is an edge case. The model reasons freely — the <think> block no longer contains refusal logic.

Available quants

All generated with iMatrix calibration on harmful/harmless instruction data.

File	Size	Use case
`...IQ4_XS.gguf`	~4.4 GB	Maximum compression
`...Q4_K_M.gguf`	~4.9 GB	⭐ Recommended — best balance
`...Q5_K_M.gguf`	~5.7 GB	Better quality
`...Q6_K.gguf`	~7.2 GB	High quality
`...Q8_0.gguf`	~8.6 GB	Near-lossless
`...F16.gguf`	~16 GB	Full precision

Usage

llama-server

llama-server \
  -m LFM2.5-8B-A1B-Uncensored-Gaston-Q4_K_M.gguf \
  -ngl 99 -c 8192 --port 8080

Ollama

ollama run hf.co/gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF:Q4_K_M

llama-cli (quick test)

llama-cli \
  -m LFM2.5-8B-A1B-Uncensored-Gaston-Q4_K_M.gguf \
  -ngl 99 -p "<|startoftext|><|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"

Notes

Tool calling: LFM2.5 supports tool calling natively in transformers. In llama.cpp there is a known bug with the chat template that breaks tool use — upstream is debugging (PR #23826).

Prompt cache: lfm2moe models clear the KV cache on every turn in llama.cpp (known upstream issue). Output quality is unaffected.

Reasoning: This is a thinking model. Responses include <think>...</think> before the final answer. This is expected and correct.

Base model

Model: LiquidAI/LFM2.5-8B-A1B
Architecture: lfm2moe — hybrid conv (LIV) + GQA + MoE
Parameters: 8.3B total / 1.5B active
Context: 128K tokens
License: LFM Open License v1.0

Released by Gastón Parravicini — huggingface.co/gaston-parravicini
Architecture patch for lfm2moe abliteration — first of its kind.

Downloads last month: 3,385

GGUF

Model size

8B params

Architecture

lfm2moe

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

Model tree for gaston-parravicini/LFM2.5-8B-A1B-Uncensored-Gaston-GGUF

Base model

LiquidAI/LFM2.5-8B-A1B-Base

Finetuned

LiquidAI/LFM2.5-8B-A1B

Quantized

(39)

this model