In this paper, the author emphasizes the importance of safety in large language models (LLMs) and highlights the vulnerabilities that arise when current safety techniques rely solely on semantics for alignment. A novel ASCII art-based jailbreak attack, ArtPrompt, is introduced to demonstrate how LLMs struggle to recognize prompts that go beyond traditional text-based inputs. The attack exploits this weakness to bypass safety measures and elicit undesired behaviors from LLMs with just black-box access. The surprising findings reveal that even state-of-the-art LLMs like GPT-3.5 and GPT-4 are susceptible to this type of attack, making it a significant threat to their security.
https://arxiv.org/abs/2402.11753