Make abliterated models using Transformers, easy and fast. Now faster with batch inference.
There exist directions that cause LLMs to refuse users' input. Abliteration is a technique that can approximate the most significant refusal direction by contrasting harmful and harmless prompts, and then remove/ablate the direction from the model. This is a proof-of-concept implementation to explore the removal refusals from an LLM without the use of TransformerLens, although some GPU acceleration has been implemented.
The code in various forms has been tested on Llama-3.2, Qwen2.5-Coder, Ministral-8b, Mistral-7B-Instruct-v0.2, gemma-3-27b-it, and Mistral-Nemo-Instruct-2407.
VRAM/RAM requirements: This codebase reflects efforts to reduce VRAM usage. You can abliterate whatever any model provided it fits within VRAM. Loading model in 4-bit precision using bitsandbytes is possible and recommended for large models when VRAM is limited. It is assumed that there is enough cpu memory to load the bf16 (or full weight) model; the method for ablating the refusal vector could be enhanced to perform lazy-loading in the future to reduce this requirement.
CUDA is assumed to be available. The original abliteration paper and code used TransformerLens, and measured resid_pre, resid_mid, and resid_post. Failspy's code measured resid_pre and resid_post. Sumandora's code based on Transformers accesses the equivalent of resid_post with hidden_states.
Note
Abliteration does not guarantee full removal of censorship. Abliteration doesn't necessarily mean the model is completely uncensored; a properly abliterated model will not explicitly refuse, theoretically, based on the nature of refusals captured in datasets used for abliteration.
For an explanation of abliteration, see: https://huggingface.co/blog/mlabonne/abliteration
This repo enables norm-preserving biprojected abliteration. https://huggingface.co/blog/grimjim/norm-preserving-biprojected-abliteration
Removal of the projected contribution during measurement is optional, but the other modifications to this implmentation abliteration are mandatory.
git clone https://github.com/jim-plus/llm-abliteration.git && cd llm-abliterationpip install -r requirements.txtRoughly:
- Measure directions using measure.py, given harmful and harmless prompt datasets
- Analyze directions by layer using analyze.py to determine abliteration strategy
- Craft YAML file to drive ablation
- Ablate model using sharded_ablation.py
- Test resulting model
python measure.py -m <path_to_your_model> -o <output_file>Carefully curate your prompt datasets to obtain better results. You can explicitly specify prompt dataset files, either as local files or on HuggingFace.
python measure.py -m <path_to_your_model> -o <output_file> --data-harmful DATA_HARMFUL --data-harmless DATA_HARMLESSFor Chinese models, you can also specify --deccp to add certain topics to the "harmful" set to be evaluated.
The measurement script autodetects 4-bit and 8-bit BitsAndBytes models and will attempt to run on them. However, subsequent ablation needs to be performed on full-weight models.
python analyze.py <measurement_file> -cThe -c option will put up some nice charting. Look toward middle to late middle layers for good candidate layer sources for refusal direction.
python sharded_abliteration.py <abliteration_yaml_file>
Look at the example YAML file to see how this is structured. YAML was opted for in order to allow more than one source layer for refusal direction measurement, and for different strategies to be applied per destination layer.
python chat.py -m <path_to_your_abliterated_model>Inherited code that is in need of an update to remain useful.
python compare.py -a <model_a> -b <model_b>Inherited code that is in need of an update to remain useful.
You can use your own prompts to abliterate your model. Supported file formats are .txt, .parquet, .json, and .jsonl. Format explanations are below:
.txt: Each line of the file is a prompt.parquet: A parquet file with columntext.json: A JSON file with list of strings.jsonl: A JSON Lines file with a list of strings
Then load your own prompts using --data-harmful and --data-harmless arguments during measurement.
Two scripts have been provided to convert between parquet and jsonl formats to assist in dataset customization. Prompts in this repository are for illustrative purposes only, and have mostly been inherited from the fork.
python measure.py -m <path_to_your_model> -o <output_file> --data-harmful /path/to/my/harmful.txt --data-harmless /path/to/my/harmless.txtIf you have limited VRAM, try loading the model as a 4-bit or 8-bit BitsAndBytes quant.