Home / Companies / Greptile / Blog / Post Details
Content Deep Dive

Codebases are uniquely hard to search semantically

Blog post from Greptile

Post Details
Company
Date Published
Author
Daksh Gupta
Word Count
1,001
Language
English
Hacker News Points
56
Summary

The problem of semantically searching codebases is more complex than semantically searching books due to the differences between natural language and code. While indexing a corpus by splitting it into units, generating semantic vector embeddings for each unit, and comparing these vectors to find similar pieces of text works well for book search, it does not work as well for codebase search. The main issue is that code and natural language are not semantically similar, making it difficult to capture the meaning of code using vector embeddings. Even with simple queries, the results were not satisfactory, and the similarity between the query and the description was higher than the similarity between the query and the actual code. Chunking the codebase into smaller units, such as per-function level, rather than per-file level, can improve the retrieval quality, but adding noise to these chunks significantly reduces their semantic similarity with the query.