Home / Companies / Surge AI / Blog / Post Details
Content Deep Dive

AI Red Teams and Adversarial Data Labeling with Redwood Research

Blog post from Surge AI

Post Details
Company
Date Published
Author
-
Word Count
1,484
Language
English
Hacker News Points
-
Summary

Surge AI, in collaboration with Redwood Research, is working on creating advanced adversarial evaluation methodologies for AI models to ensure they align with human values and do not pose existential threats. Their initial project involves developing a classifier with a highly reliable detection rate of violent text, achieved by building an AI "red team" of creative human labelers who devise new strategies to trick the model, thereby improving its robustness. This process involves training labelers to understand the nuances of what constitutes violence and employing creative tactics like logical misdirection and metaphorical language to bypass the model's detection. The insights gained are used to iteratively improve the model, with the ultimate aim of contributing to the broader machine learning community's efforts in tackling safety and alignment challenges. Additionally, Surge AI is exploring the impact of violence filters on text-generation models and is involved in other projects to evaluate the capabilities of frontier models in real-world applications.