How to crawl websites for search (part 2)

Company

Vectara

Date Published

Feb. 22, 2023

Author

Shane Connelly

Word count

957

Language

English

Hacker News points

None

URL

vectara.com/blog/how-to-crawl-websites-for-search-part-2

Summary

Vectara's web crawler sample application is designed to handle the most challenging scenarios for data ingestion, without relying on upstream semi-structured data or rendered tags. The crawler offers four modes of link discovery: single URL, sitemap, RSS feed, and recursive crawl. Each mode has its strengths and limitations, and the recursive mode requires careful consideration due to potential issues with rendering timeouts, uniqueness of links, memory usage, and discovering hidden content. Once a link is found, the crawler renders it using either Chrome or Qt WebKit, depending on the `–pdf-driver` parameter, which can impact accuracy and security. Finally, the rendered PDFs are submitted to Vectara's file upload API for processing, ensuring good search results.