Getting 10TB of GitHub Logs and Extracting Details of All Users and Repositories

The article discusses how Trickest’s workflow methodology was employed to parse over 10TB of GitHub logs, extracting public information for all the users and repositories logged within. By downloading, parsing and merging all the data, a sizeable CSV file containing information about more than 45 million users and 220 million repositories was generated. The final step was to enrich the data using the GitHub API, which improved the details about users and repositories that were created, deleted, and modified from 2015 to the present day. The article concludes by inviting readers to get access to the data and start their own journey.

https://trickest.com/blog/parsing-github-logs-with-trickest/

To top