Spot Instances and Spark: How to Run Reliably Without Paying On-Demand Prices
Blog post from Acceldata
AWS introduced Spot interruption metrics to its EC2 Capacity Manager in January 2026, facilitating the measurement and configuration needed to effectively run Spot instances at scale, particularly for Spark on EKS. Spot instances, offering significant cost savings of up to 90% off on-demand pricing, come with the challenge of being reclaimed by AWS with minimal notice, necessitating strategic configuration and observability to capture cost benefits without sacrificing job reliability. Spark is well-suited for these environments due to its ability to handle executor losses and recompute tasks, yet the complexity arises when shuffle output is interrupted, potentially leading to costly retries and SLA risks. To mitigate these challenges, AWS recommends specific configurations, such as enabling Spark decommissioning settings and employing diverse instance types to reduce the risk of simultaneous capacity loss. Visibility into interruptions is crucial, requiring a control plane that integrates EC2 events, Kubernetes pod lifecycles, and Spark executor contexts, as facilitated by solutions like Acceldata xLake. This observability allows for informed FinOps decisions by providing a comprehensive view of infrastructure events and their impact on job-level outcomes, ensuring Spark on EC2 Spot instances can be managed effectively to maximize cost savings while maintaining reliability.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| Kubernetes | 15 | 1,993 | 294 | 100 | +1% |
| Observability | 3 | 3,430 | 674 | 183 | +0% |
| Real-time | 1 | 5,457 | 1,338 | 238 | -5% |