Codebases are uniquely hard to search semantically

Company

Greptile

Date Published

Aug. 15, 2024

Author

Daksh Gupta

Word count

1001

Language

English

Hacker News points

URL

www.greptile.com/blog/semantic

Summary

The problem of semantically searching codebases is more complex than semantically searching books due to the differences between natural language and code. While indexing a corpus by splitting it into units, generating semantic vector embeddings for each unit, and comparing these vectors to find similar pieces of text works well for book search, it does not work as well for codebase search. The main issue is that code and natural language are not semantically similar, making it difficult to capture the meaning of code using vector embeddings. Even with simple queries, the results were not satisfactory, and the similarity between the query and the description was higher than the similarity between the query and the actual code. Chunking the codebase into smaller units, such as per-function level, rather than per-file level, can improve the retrieval quality, but adding noise to these chunks significantly reduces their semantic similarity with the query.