Getting Started with Redis, Apache Spark and Python

Company

Redis

Date Published

Feb. 12, 2019

Author

Redis

Word count

1519

Language

English

Hacker News points

None

URL

redis.io/blog/getting-started-redis-apache-spark-python

Summary

Apache Spark is a popular framework for creating distributed data processing pipelines. It allows for flexibility in running pipelines both locally and on a cluster without changing source code. The framework uses delayed computation or laziness to achieve this. Spark has three APIs for dealing with distributed datasets, but each API builds upon the previous one. In this article, we explored how to use Redis as a backend for Spark DataFrames using Python, specifically focusing on getting started with the DataFrame API and performing common operations such as filtering data based on occupation and country. The process involves installing pyspark, building spark-redis, setting up a Redis server, loading data into Redis, and writing a pipeline to get the most frequent occupation for famous people in each country. We also discussed the importance of scaling Redis appropriately using the Redis Cluster API to avoid bottlenecks when dealing with large datasets.