K-Steering: Controlling Multiple Behaviors in Language Models at Once
Blog post from Martian
K-steering is a novel approach to enhancing language model behavior by addressing the limitations of traditional steering methods like Contrastive Activation Addition (CAA), which struggle to combine multiple behavioral instructions due to the linearity assumption of the Linear Representation Hypothesis (LRH). Unlike CAA that averages vectors for desired behaviors, K-steering utilizes a trained classifier to dynamically compute direction shifts in the model's high-dimensional activation space, accounting for interactions between behaviors and outperforming CAA in steering models like Llama-3.2-3B and Qwen2-1.5B. This method improves the model's ability to exhibit multiple behaviors simultaneously by using the classifier's gradient to guide the activation adjustments, although challenges remain in effectively steering more than two behaviors and understanding the causation behind the steering vectors. Despite its successes, K-steering's efficacy at larger scales and its applicability to complex, real-world data sets are still open questions, as is the underlying geometry of model behavior which the method navigates.