How to crawl websites for search

Company

Vectara

Date Published

Feb. 9, 2023

Author

Shane Connelly

Word count

1015

Language

English

Hacker News points

None

URL

vectara.com/blog/how-to-crawl-websites-for-search

Summary

When deciding whether to crawl a website for search, it's essential to consider if there are better alternatives, such as machine-readable sources of semi-structured documents like raw JSON. Websites were designed for humans, not machines, and traditional HTML parsers may struggle with dynamic elements like heavy use of JavaScript and CSS. A real browser can be used to render the page and extract specific content, but this approach has limitations, especially with PDF generation still in its early stages. If crawling is the right choice, bespoke crawlers or headless browsers like Selenium, Playwright, or Puppeteer can be used to automate the process, while rendering a page as a PDF can provide a more stable output and "get what the user gets."