The Public Git Archive story

Public Git Archive is the result of months of efforts curating a dataset suitable for training Machine Learning on Source Code (aka MLonCode) models. It contains 182,000 top-starred repositories on GitHub and takes 3 TB on disk. The repositories were cloned in February-March 2018. Check out the announcement post for more information. You should check out Engine which allows to run SQL queries on top the PGA and do other cool things.


This is a companion discussion topic for the original entry at https://blog.sourced.tech/post/pga_history/