Scaling Git’s garbage collection

Post Details

Company

GitHub

Date Published

Sept. 13, 2022

Author

Taylor Blau

Word Count

5,089

Language

English

Hacker News Points

-

Source URL

github.blog/engineering/architecture-optimization/scaling-gits-garbage-collection

Summary

GitHub manages over 18.6 petabytes of Git data, a significant portion of which consists of unreachable objects such as outdated files and deleted branches. Previously, removing these objects could cause issues, particularly in large repositories, but GitHub has developed a solution through the concept of "cruft packs." These cruft packs allow Git to store unreachable objects together without affecting the overall reachability of the repository, resolving problems related to the storage and deletion of these objects. Git employs a process where unreachable objects are packed with their modification times, enabling efficient garbage collection without risking data corruption. The implementation of cruft packs has reduced the size and complexity of repositories significantly. For example, some repositories have seen their storage requirements shrink from 186 gigabytes to just 2 gigabytes. To handle potential data corruption during this process, GitHub introduced "limbo" repositories, which temporarily store expired objects, ensuring any missing data can be recovered efficiently. The entire solution, including cruft packs and limbo repositories, has been contributed to the open-source Git project, enhancing Git's ability to manage large volumes of data efficiently.