Research Highlight: Guardian Loop
Blog post from Martian
The Guardian Loop project, showcased at the Apart x Martian Mechanistic Router Interpretability Hackathon, focuses on developing mechanistically interpretable classifiers for pre-filtering prompts to enhance safety, spearheaded by Efstathios Siatras and Man Kit Chan. Utilizing Llama 3.1 8B Instruct, their Safety Judge classifier was fine-tuned on adversarial datasets, achieving promising metrics such as 85% accuracy and a 94.6% AUC-ROC score. The project employs Mechanistic Interpretability techniques to map neuron and layer activations, offering insights into why specific prompts are deemed unsafe. Future work includes developing a Feasibility Judge for assessing a model's response accuracy and an Open-Ended Adversarial Framework to discover new adversarial prompts. Martian supports this research, aligning with their goals of enhancing AI safety, cost reduction, and understanding intelligence, while also integrating findings into their model routing products.