In the Loop with Hadoop - Take Advantage of Data Locality with mongo-hadoop

Company

MongoDB

Date Published

Sept. 15, 2016

Author

Luke Lovett

Word count

891

Language

English

Hacker News points

None

URL

www.mongodb.com/blog/post/in-the-loop-with-hadoop-take-advantage-of-data-locality-with-mongo-hadoop

Summary

In the Loop with Hadoop - Take Advantage of Data Locality with mongo-hadoop` The concept of data locality is crucial for efficient processing in Spark and Hadoop jobs. It involves each worker node running computation on a single slice of input data that already resides physically on the node, reducing network requests and increasing job speed. However, when reading from MongoDB, which is not HDFS, Hadoop data nodes must open a connection to the database, leading to increased latency. A sharded cluster in MongoDB allows for improved data locality by partitioning collections into chunks, each housed on multiple shards. The mongo-hadoop connector uses this metadata to create InputSplits that map directly to shard chunks, enabling workers to compute on local data. By co-locating MongoDB shards with Hadoop/Spark worker nodes and specifying addresses to mongos instances, users can achieve faster processing and reduce latency when reading from sharded collections.