Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

From Golden Gate Bridge to Broken JSON: Why Anthropic's SAE Steering Fails for Structured Output

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Maziyar Panahi
Word Count
5,766
Language
-
Hacker News Points
-
Summary

Maziyar Panahi's article chronicles his journey through six experiments aimed at generating valid JSON outputs from language models using activation steering, a method initially promising due to its success in altering semantic behaviors without retraining. Despite the technique's previous success in modifying semantic aspects such as safety and bias in models, it drastically failed for syntactic tasks like JSON generation, reducing the valid JSON rate from 86.8% to 24.4%. Panahi discovered that activation steering, which effectively manages semantic tasks through continuous feature manipulation, falters with binary syntactic constraints requiring discrete state management. His successful resolution involved constrained decoding using a finite state machine (FSM) to enforce JSON syntax during token generation, achieving 100% valid JSON output. This approach highlighted the importance of selecting techniques based on whether the task is semantic, which benefits from activation steering, or syntactic, which requires structural enforcement.