I mirrored all the code from PyPI to GitHub and analysed it

PyPI (Python Package Index) is a massive repository that contains a vast amount of code. It consists of over 321 trillion lines of text, with a total uncompressed size of 55.0 TiB, which is equivalent to about 41 million floppy disks. The data provided only counts unique projects, not different versions of the same project. The breakdown of language features shows the most common elements used in projects, such as list comprehension, f-strings, and annotations. Additionally, PyPI contains a significant number of secrets, including API keys and access tokens. The growth of PyPI is exponential, with predictions suggesting that the number of packages will eventually surpass the global population. Binary files make up a majority of the content on PyPI, accounting for about 75% of its size. Notably, TensorFlow is the largest project, occupying 16% of all data on PyPI. The statistics also include a breakdown of file extensions and reasons why some files are not committed to GitHub. Overall, PyPI is a vast and rapidly expanding repository of code and data.


To top