How to crawl websites for search

Post Details

Company

Vectara

Date Published

Feb. 9, 2023

Author

Shane Connelly

Word Count

1,085

Language

English

Hacker News Points

-

Source URL

www.vectara.com/blog/how-to-crawl-websites-for-search

Summary

Part one of a two-part blog series on web crawling discusses the methods, appropriateness, and strategies for effectively crawling websites. It emphasizes that while crawling can make web content searchable, it is often not the best approach if machine-readable data like raw JSON is available, due to the human-centric design of web pages. The blog highlights the importance of using bespoke crawlers and real browsers to handle dynamic web content and suggests rendering pages as PDFs when access to structured data is limited. It also introduces Vectara, a search engine that enhances user interaction by providing relevant, language-independent search results, and hints at a forthcoming exploration of Vectara's web crawler application in the subsequent post.