Company
Date Published
Author
Shane Connelly
Word count
1015
Language
English
Hacker News points
None

Summary

When deciding whether to crawl a website for search, it's essential to consider if there are better alternatives, such as machine-readable sources of semi-structured documents like raw JSON. Websites were designed for humans, not machines, and traditional HTML parsers may struggle with dynamic elements like heavy use of JavaScript and CSS. A real browser can be used to render the page and extract specific content, but this approach has limitations, especially with PDF generation still in its early stages. If crawling is the right choice, bespoke crawlers or headless browsers like Selenium, Playwright, or Puppeteer can be used to automate the process, while rendering a page as a PDF can provide a more stable output and "get what the user gets."