In this paper, the authors examine the performance and behavior of two popular language models, GPT-3.5 and GPT-4, on various tasks. They found that these models can exhibit significant variations in their performance over time. For instance, GPT-4 showed high accuracy in identifying prime numbers in March 2023 but performed poorly in the same task in June 2023. Surprisingly, GPT-3.5 demonstrated improved performance in June compared to March in solving math problems. Additionally, both models showed a decrease in their willingness to answer sensitive questions in June, and also made more formatting errors in code generation. These findings highlight the importance of regularly monitoring the quality of language models.
https://arxiv.org/abs/2307.09009