Home / Companies / Confluent / Blog / Post Details
Content Deep Dive

Why Scrapinghub’s AutoExtract Chose Confluent Cloud for Their Apache Kafka Needs

Blog post from Confluent

Post Details
Company
Date Published
Author
Matt Mangia, Gil Friedlis, Ian Duffy
Word Count
1,036
Company Posts That Month
11
Language
English
Hacker News Points
-
Post removed?
No
Summary

The Scrapinghub team leverages Apache Kafka, Flink, and MongoDB to build an RAG-enabled GenAI data extraction API called AutoExtract, which extracts structured data from web pages without requiring custom code. The system receives a URL as input, fetches and renders the page, and then processes the content using AI-powered data extraction engine. Confluent Cloud is used to scale and distribute requests, providing on-demand instances and eliminating management overhead. The team chose Confluent Cloud over alternatives due to its vendor-independent pricing model, ease of use, and reliability. After migrating to Confluent Cloud, the team experienced no latency issues or throughput problems during load testing, and only minor tradeoffs, such as limited access to ZooKeeper.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Kubernetes 4 415 71 26 -16%
RAG 1 6 6 2 -45%
Real-time 1 354 133 58 -28%
Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.