Building a scalable, easy-to-use web crawler for Elastic Enterprise Search

Company

Elastic

Date Published

Jan. 12, 2021

Author

Word count

1381

Language

Hacker News points

None

URL

www.elastic.co/blog/building-a-scalable-easy-to-use-web-crawler-for-elastic-enterprise-search

Summary

Creating a scalable and user-friendly web crawler for Elastic Enterprise Search involves addressing numerous challenges, including handling misbehaving sites, duplicate content, and noncompliance with web standards. The development team has leveraged their extensive experience since the first iteration of their web crawler for Swiftype Site Search, which now processes over a billion web pages monthly, to build an advanced content ingestion mechanism. Key issues tackled include deduplicating content, identifying content uniquely even when URLs are unreliable, and managing unpredictable crawl lifecycles due to various site-specific issues. The new crawler incorporates sophisticated URL and content hashing techniques, defensive mechanisms for parsing diverse content types, and built-in heuristics to handle complex scenarios efficiently. Observability is enhanced by integrating Elasticsearch, which logs every crawler action for detailed analysis using Kibana, ensuring transparency and allowing users to understand the crawler's decision-making process. Future updates and features are anticipated as the crawler continues to evolve, with users encouraged to explore its capabilities through Elastic's free trials and resources.