Git's database internals V: scalability
Blog post from GitHub
The exploration of Git's internals likens it to a distributed database, discussing various strategies for managing large repositories as they approach scale limits, akin to sharding in databases. The series examines different sharding methods, including splitting repositories into multiple smaller ones (multi-repo sharding), using Git submodules to create a super-repository, adopting a monorepo to centralize all code, and implementing time-based sharding to manage growth. Each method carries unique benefits and challenges, such as the coordination demands of multi-repos, the build complexity of monorepos, and the disruption of time-based sharding. Additionally, data offloading through partial cloning is considered to efficiently manage storage by moving infrequently accessed data to secondary storage. The discussion emphasizes the need for careful planning and the potential use of advanced Git features to maintain performance and manageability in large codebases, while also inviting further exploration and innovation in these practices.