Home / Companies / GitHub / Blog / Post Details
Content Deep Dive

Scaling Git’s garbage collection

Blog post from GitHub

Post Details
Company
Date Published
Author
Taylor Blau
Word Count
5,089
Language
English
Hacker News Points
-
Summary

GitHub manages over 18.6 petabytes of Git data, a significant portion of which consists of unreachable objects such as outdated files and deleted branches. Previously, removing these objects could cause issues, particularly in large repositories, but GitHub has developed a solution through the concept of "cruft packs." These cruft packs allow Git to store unreachable objects together without affecting the overall reachability of the repository, resolving problems related to the storage and deletion of these objects. Git employs a process where unreachable objects are packed with their modification times, enabling efficient garbage collection without risking data corruption. The implementation of cruft packs has reduced the size and complexity of repositories significantly. For example, some repositories have seen their storage requirements shrink from 186 gigabytes to just 2 gigabytes. To handle potential data corruption during this process, GitHub introduced "limbo" repositories, which temporarily store expired objects, ensuring any missing data can be recovered efficiently. The entire solution, including cruft packs and limbo repositories, has been contributed to the open-source Git project, enhancing Git's ability to manage large volumes of data efficiently.