Company
Date Published
Author
-
Word count
989
Language
English
Hacker News points
None

Summary

Mapbox processes over 300 million miles of anonymized location data daily from its mobile SDKs to generate approximately 30 billion weekly speed estimates for roads worldwide, utilizing Apache Spark for large-scale distributed computing. The data, collected as telemetry events, is anonymized, privacy-filtered, and used to create speed probes by chaining coordinates. These probes are matched against the global road network, aggregated into speed histograms, and used to estimate expected road speeds at specific times. The pipeline's design includes partitioning datasets by temporal and spatial dimensions to improve querying efficiency and scale operations, facilitated by tools like Airflow. Addressing data skew is crucial to optimizing Spark's distributed processing; strategies include increasing partitions, adding unique IDs, and salting skewed keys. The pySpark implementation allows for rapid iteration and model improvement, requiring a comprehensive understanding of both Spark internals and data characteristics to enhance performance.