Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide
Blog post from Firecrawl
Firecrawl's /v2/crawl endpoint is designed to efficiently discover and scrape every page on a site, returning clean markdown, making it ideal for tasks such as creating training datasets or building knowledge bases. Key parameters for configuration include limit, include_paths, exclude_paths, crawl_entire_domain, sitemap, and scrape_options. For small jobs, the crawl() method can be used, while start_crawl() is recommended for larger tasks, offering delivery modes like polling via get_crawl_status(), WebSocket streaming with watcher(), or event pushing to a webhook URL. The service is accessible through a REST API, Python and Node SDKs, MCP server, and CLI, with costs based on credits per page and additional charges for JSON extraction, enhanced proxy, and PDF parsing. Firecrawl combines web scraping and crawling capabilities, enabling URL analysis, recursive traversal, and content extraction, and supports various output formats like markdown, HTML, and screenshots. The tool is useful for handling JavaScript-rendered content and complex page requirements, filtering scraped content, and integrating into RAG pipelines. Asynchronous operations, webhook configurations, and performance management are also supported, allowing for efficient large-scale web crawling and scraping.