Projected Abliteration

Company

HuggingFace

Date Published

Oct. 25, 2025

Author

Jim Lai

Word count

2218

Language

Hacker News points

None

URL

huggingface.co/blog/grimjim/projected-abliteration

Summary

Abliteration is a technique used to address refusal behaviors in language models by focusing on "refusal directions" within activation space, traditionally characterized by a single mean direction. The article introduces "projected abliteration," which refines this approach by selectively removing only the mechanistically relevant components of the refusal direction to improve compliance. This method decomposes the refusal direction into parallel and orthogonal components relative to harmless acceptance, focusing on the orthogonal component that captures refusal-specific mechanisms. The study highlights the challenges of numerical instability and high-magnitude outliers in models like Gemma 3 12B and suggests using techniques like Winsorization to maintain model coherence. The findings indicate that refusal mechanisms are robustly distributed across model layers, requiring extensive multi-layer interventions, and suggest that projected abliteration can effectively bypass refusals while preserving the encodings related to harmfulness and compliance.