Research Highlight: Guardian Loop

Post Details

Company

Martian

Date Published

June 13, 2026

Author

-

Word Count

1,218

Language

English

Hacker News Points

-

Source URL

withmartian.com/post/guardian-loop

Summary

The Guardian Loop project, showcased at the Apart x Martian Mechanistic Router Interpretability Hackathon, focuses on developing mechanistically interpretable classifiers for pre-filtering prompts to enhance safety, spearheaded by Efstathios Siatras and Man Kit Chan. Utilizing Llama 3.1 8B Instruct, their Safety Judge classifier was fine-tuned on adversarial datasets, achieving promising metrics such as 85% accuracy and a 94.6% AUC-ROC score. The project employs Mechanistic Interpretability techniques to map neuron and layer activations, offering insights into why specific prompts are deemed unsafe. Future work includes developing a Feasibility Judge for assessing a model's response accuracy and an Open-Ended Adversarial Framework to discover new adversarial prompts. Martian supports this research, aligning with their goals of enhancing AI safety, cost reduction, and understanding intelligence, while also integrating findings into their model routing products.