Home / Companies / Potpie / Blog / Post Details
Content Deep Dive

Building a Distributed Knowledge Graph Pipeline

Blog post from Potpie

Post Details
Company
Date Published
Author
Dhiren Mathur
Word Count
1,956
Language
English
Hacker News Points
-
Summary

A system has been developed to facilitate the understanding of large codebases by constructing a queryable knowledge graph from source code, allowing for semantic search, call graph traversal, and impact analysis. This system addresses the limitations of traditional tools like grep, which struggle with massive repositories, by employing a distributed parsing architecture. The knowledge graph is built using Tree-sitter for parsing and Neo4j for storage, with code elements represented as nodes and their interactions as edges. An inference pipeline enhances these nodes with LLM-generated docstrings and vector embeddings to enable semantic similarity search. The distributed architecture overcomes challenges such as memory exhaustion and task coordination by employing a bin-packing algorithm for work distribution, leveraging Redis for coordination, and using a database to manage large payloads. This approach allows agents to answer complex questions about the codebase by converting natural language queries into precise code locations and by performing change impact analysis through graph traversal. The system's distributed nature ensures scalability, resilience to failures, and efficient processing, transforming multi-day parsing tasks into operations completed in a few hours.