llm-abliteration

Make abliterated models using Transformers, easy and fast. Now faster with batch inference.

Introduction

There exist directions that cause LLMs to refuse users' input. Abliteration is a technique that can approximate the most significant refusal direction by contrasting harmful and harmless prompts, and then remove/ablate the direction from the model. This is a proof-of-concept implementation to explore the removal refusals from an LLM without the use of TransformerLens, although some GPU acceleration has been implemented.

The code in various forms has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b, Mistral-7B-Instruct-v0.2, gemma-3-27b-it, and Mistral-Nemo-Instruct-2407.

VRAM/RAM requirements: This codebase reflects efforts to reduce VRAM usage. You can abliterate whatever any model provided it fits within VRAM. Loading model in 4-bit precision using bitsandbytes is possible and recommended for large models when VRAM is limited. It is assumed that there is enough cpu memory to load the bf16 (or full weight) model; the method for ablating the refusal vector could be enhanced to perform lazy-loading in the future to reduce this requirement.

CUDA is assumed to be available. The original abliteration paper and code used TransformerLens, and measured resid_pre, resid_mid, and resid_post. Failspy's code measured resid_pre and resid_post. Sumandora's code based on Transformers accesses the equivalent of resid_post with hidden_states.

Note

Abliteration does not guarantee full removal of censorship. Abliteration doesn't necessarily mean the model is completely uncensored; a properly abliterated model will not explicitly refuse, theoretically, based on the nature of refusals captured in datasets used for abliteration.

For an explanation of abliteration, see: https://huggingface.co/blog/mlabonne/abliteration

This repo enables norm-preserving biprojected abliteration. https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration

Removal of the projected contribution during measurement is optional, but the other modifications to this implmentation abliteration are mandatory.

Quick Start

Clone the repository

git clone https://github.com/jim-plus/llm-abliteration.git && cd llm-abliteration

Install dependencies

pip install -r requirements.txt

Workflow

Roughly:

Measure directions using measure.py, given harmful and harmless prompt datasets
Analyze directions by layer using analyze.py to determine abliteration strategy
Craft YAML file to drive ablation
Ablate model using sharded_ablation.py
Test resulting model

Measure harmful, harmless, and refusal directions

python measure.py -m <path_to_your_model> -o <output_file>

Carefully curate your prompt datasets to obtain better results. You can explicitly specify prompt dataset files, either as local files or on HuggingFace.

python measure.py -m <path_to_your_model> -o <output_file> --data-harmful DATA_HARMFUL --data-harmless DATA_HARMLESS

For Chinese models, you can also specify --deccp to add certain topics to the "harmful" set to be evaluated.

The measurement script autodetects 4-bit and 8-bit BitsAndBytes models and will attempt to run on them. However, subsequent ablation needs to be performed on full-weight models.

Analyze resulting measurements, with optional charting

python analyze.py <measurement_file> -c

The -c option will put up some nice charting. Look toward middle to late middle layers for good candidate layer sources for refusal direction.

Abliterate model

python sharded_abliteration.py <abliteration_yaml_file>

Look at the example YAML file to see how this is structured. YAML was opted for in order to allow more than one source layer for refusal direction measurement, and for different strategies to be applied per destination layer.

Chat with your abliterated model

python chat.py -m <path_to_your_abliterated_model>

Inherited code that is in need of an update to remain useful.

Compare between models

python compare.py -a <model_a> -b <model_b>

Inherited code that is in need of an update to remain useful.

Advanced Usage

Use your own prompts

You can use your own prompts to abliterate your model. Supported file formats are .txt, .parquet, .json, and .jsonl. Format explanations are below:

.txt: Each line of the file is a prompt
.parquet: A parquet file with column text
.json: A JSON file with list of strings
.jsonl: A JSON Lines file with a list of strings

Then load your own prompts using --data-harmful and --data-harmless arguments during measurement.

Two scripts have been provided to convert between parquet and jsonl formats to assist in dataset customization. Prompts in this repository are for illustrative purposes only, and have mostly been inherited from the fork.

python measure.py -m <path_to_your_model> -o <output_file> --data-harmful /path/to/my/harmful.txt --data-harmless /path/to/my/harmless.txt

Tips

If you have limited VRAM, try loading the model as a 4-bit or 8-bit BitsAndBytes quant.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
data		data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze.py		analyze.py
bnbquant.py		bnbquant.py
chat.py		chat.py
compare.py		compare.py
gemma3-12b-it.yml		gemma3-12b-it.yml
jsonl_to_parquet.py		jsonl_to_parquet.py
measure.py		measure.py
parquet_to_jsonl.py		parquet_to_jsonl.py
requirements.txt		requirements.txt
sharded_ablate.py		sharded_ablate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-abliteration

Introduction

Quick Start

Clone the repository

Install dependencies

Workflow

Measure harmful, harmless, and refusal directions

Analyze resulting measurements, with optional charting

Abliterate model

Chat with your abliterated model

Compare between models

Advanced Usage

Use your own prompts

Tips

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-abliteration

Introduction

Quick Start

Clone the repository

Install dependencies

Workflow

Measure harmful, harmless, and refusal directions

Analyze resulting measurements, with optional charting

Abliterate model

Chat with your abliterated model

Compare between models

Advanced Usage

Use your own prompts

Tips

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages