Why Scrapinghub’s AutoExtract Chose Confluent Cloud for Their Apache Kafka Needs

Post Details

Company

Confluent

Date Published

Oct. 3, 2019

Author

Matt Mangia, Gil Friedlis, Ian Duffy

Word Count

1,036

Company Posts That Month

11

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.confluent.io/blog/why-scrapinghub-chose-confluent-cloud-kafka-service

Summary

The Scrapinghub team leverages Apache Kafka, Flink, and MongoDB to build an RAG-enabled GenAI data extraction API called AutoExtract, which extracts structured data from web pages without requiring custom code. The system receives a URL as input, fetches and renders the page, and then processes the content using AI-powered data extraction engine. Confluent Cloud is used to scale and distribute requests, providing on-demand instances and eliminating management overhead. The team chose Confluent Cloud over alternatives due to its vendor-independent pricing model, ease of use, and reliability. After migrating to Confluent Cloud, the team experienced no latency issues or throughput problems during load testing, and only minor tradeoffs, such as limited access to ZooKeeper.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	4	415	71	26	-16%
RAG	1	6	6	2	-45%
Real-time	1	354	133	58	-28%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.