Home / Companies / Soda / Blog / Post Details
Content Deep Dive

Modern Data Quality Testing for Spark Pipelines

Blog post from Soda

Post Details
Company
Date Published
Author
Vijay Kiran
Word Count
1,252
Language
English
Hacker News Points
-
Summary

Soda Spark has been deprecated and replaced by the Soda Library, which connects Soda to Apache Spark, offering a modern approach to data testing, monitoring, and reliability for engineering teams using PySpark DataFrames. This tool is designed to help data and analytics engineers maintain high data quality in data-intensive environments by providing an API for extracting data metrics and column profiles via YAML configuration files. When connected to a Soda Cloud account, Soda Spark enables teams to configure alerts for failed tests, facilitating quick issue resolution. The tool simplifies the process of writing declarative data quality tests for Spark DataFrames and is compatible with various data workloads, engines, and environments, such as Kafka, AWS S3, and Google BigQuery. With easy installation via pip and integration into platforms like Databricks, Soda Spark supports collaborative efforts through Soda Cloud, which offers features like anomaly detection, schema evolution monitoring, and visualization of test results. Soda's open-source and SaaS offerings emphasize data reliability, with a strong community backing, including major contributors like Disney and HelloFresh.