Scaling Git’s garbage collection
Blog post from GitHub
GitHub manages over 18.6 petabytes of Git data, a significant portion of which consists of unreachable objects such as outdated files and deleted branches. Previously, removing these objects could cause issues, particularly in large repositories, but GitHub has developed a solution through the concept of "cruft packs." These cruft packs allow Git to store unreachable objects together without affecting the overall reachability of the repository, resolving problems related to the storage and deletion of these objects. Git employs a process where unreachable objects are packed with their modification times, enabling efficient garbage collection without risking data corruption. The implementation of cruft packs has reduced the size and complexity of repositories significantly. For example, some repositories have seen their storage requirements shrink from 186 gigabytes to just 2 gigabytes. To handle potential data corruption during this process, GitHub introduced "limbo" repositories, which temporarily store expired objects, ensuring any missing data can be recovered efficiently. The entire solution, including cruft packs and limbo repositories, has been contributed to the open-source Git project, enhancing Git's ability to manage large volumes of data efficiently.