This study reconstructs a comprehensive archive of 2.7 million unique U.S. newswire articles from 1878 to 1977 by utilizing deep learning on raw image scans of local newspapers. The dataset includes georeferenced locations, customized topic tags, named entities, and disambiguated individuals linked to Wikipedia. A neural bi-encoder model de-duplicates reproduced articles, while a text classifier ensures only public domain newswire articles are included. This Newswire dataset offers valuable insights into historical news consumption and is beneficial for language modeling and research in computational linguistics, social science, and digital humanities.
https://arxiv.org/abs/2406.09490