Home / Companies / Acceldata / Blog / Post Details
Content Deep Dive

Spot Instances and Spark: How to Run Reliably Without Paying On-Demand Prices

Blog post from Acceldata

Post Details
Company
Date Published
Author
Shreya Bose
Word Count
1,633
Company Posts That Month
28
Language
English
Hacker News Points
-
Summary

AWS introduced Spot interruption metrics to its EC2 Capacity Manager in January 2026, facilitating the measurement and configuration needed to effectively run Spot instances at scale, particularly for Spark on EKS. Spot instances, offering significant cost savings of up to 90% off on-demand pricing, come with the challenge of being reclaimed by AWS with minimal notice, necessitating strategic configuration and observability to capture cost benefits without sacrificing job reliability. Spark is well-suited for these environments due to its ability to handle executor losses and recompute tasks, yet the complexity arises when shuffle output is interrupted, potentially leading to costly retries and SLA risks. To mitigate these challenges, AWS recommends specific configurations, such as enabling Spark decommissioning settings and employing diverse instance types to reduce the risk of simultaneous capacity loss. Visibility into interruptions is crucial, requiring a control plane that integrates EC2 events, Kubernetes pod lifecycles, and Spark executor contexts, as facilitated by solutions like Acceldata xLake. This observability allows for informed FinOps decisions by providing a comprehensive view of infrastructure events and their impact on job-level outcomes, ensuring Spark on EC2 Spot instances can be managed effectively to maximize cost savings while maintaining reliability.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Kubernetes 15 1,993 294 100 +1%
Observability 3 3,430 674 183 +0%
Real-time 1 5,457 1,338 238 -5%