Home / Companies / Martian / Blog / Post Details
Content Deep Dive

Research Highlight: Guardian Loop

Blog post from Martian

Post Details
Company
Date Published
Author
-
Word Count
1,218
Language
English
Hacker News Points
-
Summary

The Guardian Loop project, showcased at the Apart x Martian Mechanistic Router Interpretability Hackathon, focuses on developing mechanistically interpretable classifiers for pre-filtering prompts to enhance safety, spearheaded by Efstathios Siatras and Man Kit Chan. Utilizing Llama 3.1 8B Instruct, their Safety Judge classifier was fine-tuned on adversarial datasets, achieving promising metrics such as 85% accuracy and a 94.6% AUC-ROC score. The project employs Mechanistic Interpretability techniques to map neuron and layer activations, offering insights into why specific prompts are deemed unsafe. Future work includes developing a Feasibility Judge for assessing a model's response accuracy and an Open-Ended Adversarial Framework to discover new adversarial prompts. Martian supports this research, aligning with their goals of enhancing AI safety, cost reduction, and understanding intelligence, while also integrating findings into their model routing products.