Home / Companies / GitHub / Blog / Post Details
Content Deep Dive

Git's database internals V: scalability

Blog post from GitHub

Post Details
Company
Date Published
Author
Derrick Stolee
Word Count
3,375
Language
English
Hacker News Points
-
Summary

The exploration of Git's internals likens it to a distributed database, discussing various strategies for managing large repositories as they approach scale limits, akin to sharding in databases. The series examines different sharding methods, including splitting repositories into multiple smaller ones (multi-repo sharding), using Git submodules to create a super-repository, adopting a monorepo to centralize all code, and implementing time-based sharding to manage growth. Each method carries unique benefits and challenges, such as the coordination demands of multi-repos, the build complexity of monorepos, and the disruption of time-based sharding. Additionally, data offloading through partial cloning is considered to efficiently manage storage by moving infrequently accessed data to secondary storage. The discussion emphasizes the need for careful planning and the potential use of advanced Git features to maintain performance and manageability in large codebases, while also inviting further exploration and innovation in these practices.