Apache Spark at ScyllaDB Summit, Part 1: Best Practices
Blog post from ScyllaDB
Apache Spark, a unified analytics engine for large-scale data processing, was a key topic at the ScyllaDB Summit 2018, where Eyal Gutkind from ScyllaDB shared insights on best practices for integrating Spark with ScyllaDB in heterogeneous data environments. Gutkind emphasized the importance of understanding the interplay between cluster configurations to enhance resilience in long-running analytics jobs and addressed common challenges such as deployment in diverse big data ecosystems and optimizing analytics workloads. He provided strategies for efficiently deploying Spark with ScyllaDB, such as properly sizing nodes, tuning partition sizes, and configuring concurrency settings and retry policies to improve data processing and reduce latency. Gutkind also discussed the architectural differences between Spark and ScyllaDB, particularly in data sharding and batch processing, highlighting that Spark's lazy data consumption contrasts with ScyllaDB's even data distribution across nodes. The session underscored the importance of resource management and system tuning to achieve optimal performance, suggesting that separating Spark and ScyllaDB clusters could lead to improved efficiency and performance.