In this paper, the authors highlight the susceptibility of widely-used large language models (LLMs) like GPT, Llama, and Claude to jailbreaking attacks, where adversaries manipulate the model into producing objectionable content. To combat this issue, they introduce SmoothLLM, the first algorithm specifically designed to defend against jailbreaking attacks. By randomly altering input prompts and aggregating predictions, SmoothLLM outperforms other defenses against various jailbreak techniques. The research also reveals a trade-off between robustness and performance, but overall, SmoothLLM proves to be effective and adaptable across different LLMs. The public availability of the code enhances the accessibility and applicability of this innovative defense mechanism.
https://arxiv.org/abs/2310.03684