Improving Parquet Dedupe on Hugging Face Hub

The Xet team at Hugging Face is dedicated to enhancing the Hub’s storage architecture efficiency to streamline data and model storage for users. With a substantial amount of data hosted by Hugging Face, optimizing Parquet files is a priority due to its importance in reducing storage space. Deduplication is crucial for updating datasets efficiently, especially with bulk data exports. Experiments on Parquet files revealed challenges in deduplication with modifications and deletions. A solution proposed is using Content Defined Row Groups for better deduplication results, although improvements to Parquet file dedupe-ability overall are still being explored. Collaboration with Apache Arrow is also considered for potential enhancements.

https://huggingface.co/blog/improve_parquet_dedupe

To top