In the Loop with Hadoop - Take Advantage of Data Locality with mongo-hadoop`
The concept of data locality is crucial for efficient processing in Spark and Hadoop jobs. It involves each worker node running computation on a single slice of input data that already resides physically on the node, reducing network requests and increasing job speed. However, when reading from MongoDB, which is not HDFS, Hadoop data nodes must open a connection to the database, leading to increased latency. A sharded cluster in MongoDB allows for improved data locality by partitioning collections into chunks, each housed on multiple shards. The mongo-hadoop connector uses this metadata to create InputSplits that map directly to shard chunks, enabling workers to compute on local data. By co-locating MongoDB shards with Hadoop/Spark worker nodes and specifying addresses to mongos instances, users can achieve faster processing and reduce latency when reading from sharded collections.