Spot Instances and Spark: How to Run Reliably Without Paying On-Demand Prices

Post Details

Company

Acceldata

Date Published

June 25, 2026

Author

Shreya Bose

Word Count

1,633

Company Posts That Month

28

Language

English

Hacker News Points

-

Source URL

www.acceldata.io/blog/spot-instances-and-spark-how-to-run-reliably-without-paying-on-demand-prices

Summary

AWS introduced Spot interruption metrics to its EC2 Capacity Manager in January 2026, facilitating the measurement and configuration needed to effectively run Spot instances at scale, particularly for Spark on EKS. Spot instances, offering significant cost savings of up to 90% off on-demand pricing, come with the challenge of being reclaimed by AWS with minimal notice, necessitating strategic configuration and observability to capture cost benefits without sacrificing job reliability. Spark is well-suited for these environments due to its ability to handle executor losses and recompute tasks, yet the complexity arises when shuffle output is interrupted, potentially leading to costly retries and SLA risks. To mitigate these challenges, AWS recommends specific configurations, such as enabling Spark decommissioning settings and employing diverse instance types to reduce the risk of simultaneous capacity loss. Visibility into interruptions is crucial, requiring a control plane that integrates EC2 events, Kubernetes pod lifecycles, and Spark executor contexts, as facilitated by solutions like Acceldata xLake. This observability allows for informed FinOps decisions by providing a comprehensive view of infrastructure events and their impact on job-level outcomes, ensuring Spark on EC2 Spot instances can be managed effectively to maximize cost savings while maintaining reliability.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	15	1,993	294	100	+1%
Observability	3	3,430	674	183	+0%
Real-time	1	5,457	1,338	238	-5%