Dealing with Hadoop's small files problem

Post Details

Company

Snowplow

Date Published

May 30, 2013

Author

Alex Dean

Word Count

1,626

Language

English

Hacker News Points

-

Source URL

snowplow.io/blog/dealing-with-hadoops-small-files-problem

Summary

Hadoop's Small File Problem significantly impacts the performance of MapReduce jobs, particularly evident in Snowplow's experience, where thousands of small CloudFront log files led to prolonged processing times. By aggregating these small files, Snowplow dramatically reduced their job's processing time from nearly three hours to just nine minutes, highlighting a 1,867% improvement. This aggregation not only sped up the Enrichment process but also facilitated faster loading into Redshift due to fewer part-output files. In addressing the issue, Snowplow evaluated several solutions, ultimately choosing Amazon's S3DistCp for its ability to handle S3 files using the --groupBy option, which aggregates small files efficiently. This choice was supported by the use of Elasticity Ruby library to integrate S3DistCp into the jobflow, effectively compressing the files into LZO format and optimizing Hadoop's performance without altering the primary ETL job. This case study underscores the critical importance of solving the small file problem to enhance Hadoop job efficiency.

Dealing with Hadoop&#039;s small files problem

Dealing with Hadoop's small files problem