Company
Date Published
Author
Jim Lai
Word count
2218
Language
-
Hacker News points
None

Summary

Abliteration is a technique used to address refusal behaviors in language models by focusing on "refusal directions" within activation space, traditionally characterized by a single mean direction. The article introduces "projected abliteration," which refines this approach by selectively removing only the mechanistically relevant components of the refusal direction to improve compliance. This method decomposes the refusal direction into parallel and orthogonal components relative to harmless acceptance, focusing on the orthogonal component that captures refusal-specific mechanisms. The study highlights the challenges of numerical instability and high-magnitude outliers in models like Gemma 3 12B and suggests using techniques like Winsorization to maintain model coherence. The findings indicate that refusal mechanisms are robustly distributed across model layers, requiring extensive multi-layer interventions, and suggest that projected abliteration can effectively bypass refusals while preserving the encodings related to harmfulness and compliance.