Full Text Search in SmithDB: Constructing and Querying our Inverted Index (Pt. 2)
Blog post from LangChain
SmithDB's inverted index implementation facilitates rapid full-text search by constructing, compacting, and querying indexes during data ingestion, allowing new data runs to become searchable within seconds. Index construction is integrated with data ingestion, indexing payloads through a JSON tape based on Apache Arrow's arrow-json crate, and using string interning to optimize sorting. The service uses finite state transducers for term layout and implements a multi-tiered storage approach, leveraging local SSDs for immediate visibility and object storage for durability. At query time, predicates are processed through a unified pipeline that distinguishes between indexed and non-indexed segments without altering the SQL interface, effectively balancing between immediate local reads and comprehensive object-storage reads. The system's design ensures efficient query execution by coalescing GET requests and optimizing memory use during index merging, allowing for sub-second query freshness by treating the local storage tier as an integral part of the index rather than a separate entity.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| Observability | 4 | 3,430 | 674 | 183 | +0% |
| Real-time | 1 | 5,457 | 1,338 | 238 | -5% |