Company
Date Published
Author
Kelley Robinson
Word count
1334
Language
English
Hacker News points
None

Summary

Apache Spark is a distributed data processing engine that aims to solve the problem of working with large-scale distributed data. In this tutorial, Plushcap and Kelley Robinson explore how to get started with Apache Spark by analyzing pwned passwords, which consist of over 500 million leaked passwords. The authors use Apache Zeppelin as an interactive web application notebook for creating and running Spark applications, including reading in CSV data, viewing schema information, and performing data analysis using SQL. They also demonstrate how to work with Datasets, a new abstraction introduced recently to the Spark project, and explore common password lengths and patterns using Spark SQL functions. The tutorial provides a comprehensive introduction to Spark and its tools for working with distributed data, making it an excellent resource for developers and data analysts interested in learning more about Apache Spark.