AI Has Created a Battle over Web Crawling

Generative AI relies on massive data sets collected from public sources like blogs, videos, and Reddit to train models. However, a recent report reveals a crisis as more websites use robots.txt to restrict data access for fear of AI impinging on their livelihoods. The rise in restricted data poses challenges for AI companies. Exclusive data acquisition and synthetic data may fill the gap, but concerns about model collapse persist. Peak data may be a looming issue, with untapped data hidden behind PDFs and locked away in various formats. Future trends suggest more websites imposing restrictions, prompting a call for industry standardization to make expressing data use preferences easier.

https://spectrum.ieee.org/web-crawling