Crawling a quarter billion webpages in 40 hours (2012)

The author of the post describes his experience crawling a small but non-trivial fraction of the web. He crawled 250,113,669 pages for under $580 in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances. The post details the architecture of the crawler, the use of threads, and the use of a Bloom filter. The author also discusses the challenges of managing the url frontier and dealing with anticipated and unanticipated errors. The post raises questions about who should be allowed to crawl the web and possible solutions such as services like Common Crawl. The author wrote the code in Python but decided not to release it due to concerns over the burden the crawler could impose on websites.

https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/

To top