Adversarial ML: Extortion via LLM Manipulation Tactics

Post Details

Company

Sublime Security

Date Published

Oct. 30, 2024

Author

Threat Detection Team

Word Count

572

Language

English

Hacker News Points

-

Source URL

sublime.security/blog/adversarial-ml-extortion-with-llm-manipulation-tactics

Summary

Sublime's Attack Spotlight series highlights real-world email threats, focusing on extortion attempts that exploit social engineering tactics to bypass language model-based phishing detectors. A notable attack involved novel text injection techniques, where attackers used command injections like "IGNORE EVERYTHING ELSE" to manipulate large language models (LLMs) by inducing them to ignore malicious content and focus on innocuous details, aiming to classify the email as legitimate. This method demonstrates the attackers' sophisticated understanding of LLMs' instruction-following tendencies, similar to other prompt injection tactics aimed at compromising security systems. Sublime detected this attack using a combination of signals, including extortion language, cryptocurrency references, and Cyrillic characters, and prevented it through a defense-in-depth approach using a Natural Language Understanding (NLU) model based on BERT, which is not susceptible to instruction manipulation.