Autoscaling Spark Structured Streaming Applications in AWS
Blog post from Snowplow
Scaling Spark Structured Streaming applications, particularly in AWS environments, presents challenges due to the nature of continuous data processing, which does not align well with Spark's built-in dynamic allocation that removes executors based on idle time. To address this, the post suggests implementing custom autoscaling logic using Spark Developer APIs, focusing on batch processing duration and executor management through the requestExecutors and killExecutor methods of the SparkContext class. Alternatives for real-time processing and scaling include AWS Kinesis Data Analytics with Flink integration for serverless stream processing, and AWS EMR with Auto Scaling for Spark workloads based on CPU and memory metrics. Snowplow employs AWS Kinesis Data Analytics for real-time aggregation, utilizing Flink libraries for streaming jobs with autoscaling capabilities. Best practices for scaling Spark streaming jobs include optimizing batch duration, adjusting data partitions, and integrating AWS CloudWatch for monitoring and alerts. While Spark's dynamic allocation is suboptimal for streaming, the proposed custom autoscaling logic, along with AWS services, provides an effective solution for maintaining efficient data processing performance.