A multimodal dataset with one trillion tokens

MINT-1T is a groundbreaking open-source Multimodal INTerleaved dataset with one trillion text tokens and 3.4 billion images, making it 10 times larger than other existing datasets. What sets MINT-1T apart is its inclusion of unexplored sources like PDFs and ArXiv papers. The dataset is available in various subsets, including HTML and PDF data, with shards for each CommonCrawl snapshot. A recent update includes ArXiv data from July 24th. An impressive milestone was achieved on June 17th when the dataset was open-sourced, along with the release of a detailed technical report. The authors encourage citing their work if found useful.

https://github.com/mlfoundations/MINT-1T