Uncensor Any LLM with Abliteration

The third generation of Llama models excels in understanding and following instructions but is heavily censored to refuse harmful requests with responses like “As an AI assistant, I cannot help you.” A technique called “abliteration” can uncensor any LLM without retraining by removing the refusal mechanism. By identifying and ablating the refusal direction within the model using inference-time intervention or weight orthogonalization, the model’s ability to refuse requests can be eliminated. The process involves data collection, mean difference calculation, vector selection, and direction ablation. Implementing weight orthogonalization successfully uncensors the model, allowing it to respond to various prompts. Abliterated models have been evaluated and compared to the source model with significant results.

https://huggingface.co/blog/mlabonne/abliteration