The web content discusses the Midjourney dataset on Hugging Face, which includes a large collection of images with associated prompts and URLs. The dataset consists of 55,082,563 images, totaling 148.07 TB in size, which has resulted in significant CDN costs for Discord. The author explains how they calculated the size of the dataset without downloading all the data using DuckDB and showcases the query results. They also mention using the nettop command to track network usage and provide an alternative concise query using read_parquet(). There is no controversial or surprising content.
https://til.simonwillison.net/duckdb/remote-parquet