RedPajama v2 Open Dataset with 30T Tokens for Training LLMs

Today, we are excited to release RedPajama-Data-v2, a massive web dataset for language model training. This dataset contains 30 trillion filtered and deduplicated tokens from 84 CommonCrawl dumps, covering 5 languages. It also includes over 40 pre-computed data quality annotations that can be used for further filtering and weighting. RedPajama-Data-v2 is the largest public dataset specifically designed for LLM training. It aims to provide a comprehensive coverage of CommonCrawl and offers a wide range of quality annotations. The dataset is open source and available on GitHub and HuggingFace. We hope this release will benefit the LLM developer community and encourage further research in pre-training datasets.

https://together.ai/blog/redpajama-data-v2