Home / Companies / Vectara / Blog / Post Details
Content Deep Dive

How to crawl websites for search

Blog post from Vectara

Post Details
Company
Date Published
Author
Shane Connelly
Word Count
1,085
Language
English
Hacker News Points
-
Summary

Part one of a two-part blog series on web crawling discusses the methods, appropriateness, and strategies for effectively crawling websites. It emphasizes that while crawling can make web content searchable, it is often not the best approach if machine-readable data like raw JSON is available, due to the human-centric design of web pages. The blog highlights the importance of using bespoke crawlers and real browsers to handle dynamic web content and suggests rendering pages as PDFs when access to structured data is limited. It also introduces Vectara, a search engine that enhances user interaction by providing relevant, language-independent search results, and hints at a forthcoming exploration of Vectara's web crawler application in the subsequent post.