AI Safety Grant Update: Purging Corrupted Capabilities across Language Models
Blog post from Martian
A team funded by Martian's AI safety grant has made significant advancements in AI safety by developing a technique that allows safety behaviors to be transferred across different language models. This breakthrough could streamline the process of implementing safety measures, as it reduces the need to analyze each model individually, potentially saving considerable computational resources. The research focuses on scaling mechanistic interpretability techniques and introduces steering vectors as a method to mitigate undesirable behaviors in large language models (LLMs) more effectively. This progress, detailed in a report on LessWrong, marks a promising step in the ongoing efforts to enhance AI safety and efficiency, and the organization is actively seeking individuals interested in contributing to such innovative projects.