Home / Companies / Snowplow / Blog / Post Details
Content Deep Dive

Dealing with Hadoop's small files problem

Blog post from Snowplow

Post Details
Company
Date Published
Author
Alex Dean
Word Count
1,626
Language
English
Hacker News Points
-
Summary

Hadoop's Small File Problem significantly impacts the performance of MapReduce jobs, particularly evident in Snowplow's experience, where thousands of small CloudFront log files led to prolonged processing times. By aggregating these small files, Snowplow dramatically reduced their job's processing time from nearly three hours to just nine minutes, highlighting a 1,867% improvement. This aggregation not only sped up the Enrichment process but also facilitated faster loading into Redshift due to fewer part-output files. In addressing the issue, Snowplow evaluated several solutions, ultimately choosing Amazon's S3DistCp for its ability to handle S3 files using the --groupBy option, which aggregates small files efficiently. This choice was supported by the use of Elasticity Ruby library to integrate S3DistCp into the jobflow, effectively compressing the files into LZO format and optimizing Hadoop's performance without altering the primary ETL job. This case study underscores the critical importance of solving the small file problem to enhance Hadoop job efficiency.