Home / Companies / Sigma / Blog / Post Details
Content Deep Dive

Optimizing Data Retrieval: A Deep Dive Into Sigma's Use Of Arrow Batches

Blog post from Sigma

Post Details
Company
Date Published
Author
Yifeng Wu
Word Count
1,523
Language
English
Hacker News Points
-
Summary

Sigma has integrated Apache Arrow into its architecture to enhance data movement across services, significantly improving performance and efficiency by eliminating unnecessary data conversions. Apache Arrow's in-memory columnar format supports zero-copy read operations, enabling fast handling of large datasets, especially with cloud data warehouses like Snowflake and Databricks. By using Arrow batches, Sigma has optimized its connection manager to bypass redundant steps, allowing for more efficient resource management and improved system reliability. This transition to Arrow batches involves addressing various challenges, such as the Year 2262 timestamp issue and precision loss in numerical data, by implementing specific solutions like microsecond precision timestamps and high-precision flags. As a result, Sigma has achieved substantial reductions in memory and CPU usage, enhancing system stability and enabling larger, more complex workloads. The company is also working on integrating Arrow batches with other platforms like BigQuery and Databricks and invites engineers to join their team to further push the boundaries of data analytics.