Home / Companies / ScyllaDB / Blog / Post Details
Content Deep Dive

Apache Spark at ScyllaDB Summit, Part 1: Best Practices

Blog post from ScyllaDB

Post Details
Company
Date Published
Author
Peter Corless
Word Count
2,273
Language
English
Hacker News Points
-
Summary

Apache Spark, a unified analytics engine for large-scale data processing, was a key topic at the ScyllaDB Summit 2018, where Eyal Gutkind from ScyllaDB shared insights on best practices for integrating Spark with ScyllaDB in heterogeneous data environments. Gutkind emphasized the importance of understanding the interplay between cluster configurations to enhance resilience in long-running analytics jobs and addressed common challenges such as deployment in diverse big data ecosystems and optimizing analytics workloads. He provided strategies for efficiently deploying Spark with ScyllaDB, such as properly sizing nodes, tuning partition sizes, and configuring concurrency settings and retry policies to improve data processing and reduce latency. Gutkind also discussed the architectural differences between Spark and ScyllaDB, particularly in data sharding and batch processing, highlighting that Spark's lazy data consumption contrasts with ScyllaDB's even data distribution across nodes. The session underscored the importance of resource management and system tuning to achieve optimal performance, suggesting that separating Spark and ScyllaDB clusters could lead to improved efficiency and performance.