Crawl Order and Disorder

The search engine’s crawler has improved its efficiency after migrating to slop crawl data, reducing memory requirements by 80%. The crawler now completes 99.9% of its crawling in just 4 days, with the remaining 0.1% taking a week, due to limitations on concurrent crawl tasks per domain name. Academic domains with numerous subdomains pose a challenge due to underpowered machines and large websites. To optimize crawling, the order of tasks is now based on subdomain count, with jitter added to request timing to prevent overwhelming sites. This change prioritizes crawling academic websites over blog hosts, improving overall performance and efficiency.

https://www.marginalia.nu/log/a_117_crawl_order/

To top