Stealing Part of a Production Language Model

In their recent work, Carlini et al. introduce a groundbreaking model-stealing attack that can extract detailed information from black-box language models like OpenAI’s ChatGPT or Google’s PaLM-2. They demonstrate that for a mere $20, they were able to extract the entire projection matrix of OpenAI’s Ada and Babbage models, revealing hidden dimensions of 1024 and 2048, respectively. Surprisingly, they also uncovered the exact hidden dimension size of the gpt-3.5-turbo model and estimated the cost of recovering the entire projection matrix. The authors discuss potential defenses against such attacks and highlight the need for further research in this area.

https://arxiv.org/abs/2403.06634