AI Red Teams and Adversarial Data Labeling with Redwood Research

Post Details

Company

Surge AI

Date Published

June 28, 2022

Author

-

Word Count

1,484

Language

English

Hacker News Points

-

Source URL

surgehq.ai/blog/ai-red-teams-and-adversarial-data-labeling-with-redwood-research

Summary

Surge AI, in collaboration with Redwood Research, is working on creating advanced adversarial evaluation methodologies for AI models to ensure they align with human values and do not pose existential threats. Their initial project involves developing a classifier with a highly reliable detection rate of violent text, achieved by building an AI "red team" of creative human labelers who devise new strategies to trick the model, thereby improving its robustness. This process involves training labelers to understand the nuances of what constitutes violence and employing creative tactics like logical misdirection and metaphorical language to bypass the model's detection. The insights gained are used to iteratively improve the model, with the ultimate aim of contributing to the broader machine learning community's efforts in tackling safety and alignment challenges. Additionally, Surge AI is exploring the impact of violence filters on text-generation models and is involved in other projects to evaluate the capabilities of frontier models in real-world applications.