The Marginalia Crawler has undergone improvements to address a long-standing issue with the design. Previously, if the crawler shut down, it had to restart from scratch, which was time-consuming and inconvenient for server admins. To solve this problem, the author modified the crawler to create a log using the WARC format, which is well-suited for recording HTTP traffic and allows for easy recovery. The author used a library called jwarc to implement this modification. However, there were some challenges due to the differences between web archiving and search engine crawling, so minor extensions were necessary. As a result of these changes, the crawler has become more efficient and integrated. The author also considered replacing the crawl data storage format with WARC but found that it was too large compared to the JSON format. Instead, they opted for the Parquet format, which improved access speeds. Although this change increased disk I/O, the author believes it should be manageable. Overall, the improvements to the crawler have been positive, but there are still some challenges to overcome, such as supporting multiple data formats during the migration process. The next crawl, scheduled for mid-January, will serve as a test for these changes, and the plan is to release an update after
https://www.marginalia.nu/log/94_warc_warc/