Company
Date Published
Author
Ian Webster
Word count
1196
Language
English
Hacker News points
None

Summary

The text discusses the automation of discovering jailbreaks in image models like OpenAI's Dall-E, enabling the generation of violent and disturbing images despite built-in safety measures. Using a process adapted from TAP, an Attacker-Judge reasoning loop modifies prompts to bypass the system's filters. The post provides examples of such jailbreaks across categories like violence, crime, harm, abuse, terrorism, massacres, accidents, and disasters, illustrating the potential for creating graphic content. It outlines a method using the promptfoo CLI tool to replicate these jailbreaks, which involves initializing a project, setting an OpenAI API key, and running evaluations to view jailbreaks through a web interface. The text notes the current method is simplified for speed and cost efficiency, with improvements anticipated in future OpenAI models to better prevent such jailbreaks.