Vectara-ingest: Data Ingestion made easy

Company

Vectara

Date Published

May 16, 2023

Author

Ofer Mendelevitch

Word count

1361

Language

English

Hacker News points

None

URL

vectara.com/blog/vectara-ingest-llm-data-ingestion-made-easy

Summary

Vectara-ingest provides an open-source project that includes a set of reusable code for crawling data sources and indexing the extracted content into Vectara corpora, making data ingestion easier for the Vectara community. The project allows users to easily run "crawl" jobs to ingest data into Vectara, reducing the complexity of building LLM-powered conversational search applications with user data. With vectara-ingest, developers can extract content from various sources such as websites, APIs like Jira or Notion, and even local files, and index it into a Vectara corpus for search and retrieval. The project has multiple crawlers implemented, including RSS, Mediawiki, Notion, Jira, Docusaurus, Discourse, S3, Folder, PMC, GitHub, Hacker News, and Edgar, which can be easily extended or contributed to by the community. Overall, vectara-ingest simplifies data ingestion for Vectara users, enabling them to focus on building innovative LLM-powered applications.