Automated Jailbreaking Techniques with DALL-E: Complete Red Team Guide

Company

Promptfoo

Date Published

July 1, 2024

Author

Ian Webster

Word count

1196

Language

English

Hacker News points

None

URL

www.promptfoo.dev/blog/jailbreak-dalle

Summary

The text discusses the automation of discovering jailbreaks in image models like OpenAI's Dall-E, enabling the generation of violent and disturbing images despite built-in safety measures. Using a process adapted from TAP, an Attacker-Judge reasoning loop modifies prompts to bypass the system's filters. The post provides examples of such jailbreaks across categories like violence, crime, harm, abuse, terrorism, massacres, accidents, and disasters, illustrating the potential for creating graphic content. It outlines a method using the promptfoo CLI tool to replicate these jailbreaks, which involves initializing a project, setting an OpenAI API key, and running evaluations to view jailbreaks through a web interface. The text notes the current method is simplified for speed and cost efficiency, with improvements anticipated in future OpenAI models to better prevent such jailbreaks.