Calculating 30 billion speed estimates a week with Apache Spark

Company

Mapbox

Date Published

June 12, 2019

Author

Word count

989

Language

English

Hacker News points

None

URL

www.mapbox.com/blog/calculating-speed-estimates

Summary

Mapbox processes over 300 million miles of anonymized location data daily from its mobile SDKs to generate approximately 30 billion weekly speed estimates for roads worldwide, utilizing Apache Spark for large-scale distributed computing. The data, collected as telemetry events, is anonymized, privacy-filtered, and used to create speed probes by chaining coordinates. These probes are matched against the global road network, aggregated into speed histograms, and used to estimate expected road speeds at specific times. The pipeline's design includes partitioning datasets by temporal and spatial dimensions to improve querying efficiency and scale operations, facilitated by tools like Airflow. Addressing data skew is crucial to optimizing Spark's distributed processing; strategies include increasing partitions, adding unique IDs, and salting skewed keys. The pySpark implementation allows for rapid iteration and model improvement, requiring a comprehensive understanding of both Spark internals and data characteristics to enhance performance.