Are large language models a threat to digital public goods?

Large language models like ChatGPT have become a popular tool for obtaining information and solving problems, potentially replacing the need for searching the web or asking for help from others. However, this convenience comes at a cost. By interacting privately with these models, users are reducing the availability of publicly accessible human-generated data and knowledge resources. This poses a significant challenge in terms of securing training data for future models. To understand the impact of ChatGPT on human-generated open data, researchers analyzed the activity on Stack Overflow, a leading Q&A platform for programming. The findings show a significant decrease in activity on Stack Overflow compared to similar forums in Russian, Chinese, and mathematics domains where access to ChatGPT is limited or less capable. The decrease in activity becomes more pronounced over time and is particularly noticeable for posts related to popular programming languages. Interestingly, posts made after ChatGPT are receiving similar voting scores as before, indicating that the model is not simply displacing low-quality or duplicate content. These results suggest that more users are turning to large language models as a substitute for Stack Overflow, especially for languages where ChatGPT has more training data. While this may provide more efficient solutions to programming problems, it also means a shift away from public exchange on the

To top