From Golden Gate Bridge to Broken JSON: Why Anthropic's SAE Steering Fails for Structured Output
Blog post from HuggingFace
Maziyar Panahi's article chronicles his journey through six experiments aimed at generating valid JSON outputs from language models using activation steering, a method initially promising due to its success in altering semantic behaviors without retraining. Despite the technique's previous success in modifying semantic aspects such as safety and bias in models, it drastically failed for syntactic tasks like JSON generation, reducing the valid JSON rate from 86.8% to 24.4%. Panahi discovered that activation steering, which effectively manages semantic tasks through continuous feature manipulation, falters with binary syntactic constraints requiring discrete state management. His successful resolution involved constrained decoding using a finite state machine (FSM) to enforce JSON syntax during token generation, achieving 100% valid JSON output. This approach highlighted the importance of selecting techniques based on whether the task is semantic, which benefits from activation steering, or syntactic, which requires structural enforcement.