Modern Data Quality Testing for Spark Pipelines

Post Details

Company

Soda

Date Published

Dec. 22, 2022

Author

Vijay Kiran

Word Count

1,252

Company Posts That Month

2

Language

English

Hacker News Points

-

Post removed?

No

Source URL

soda.io/blog/data-quality-testing-spark-pipelines

Summary

Soda Spark has been deprecated and replaced by the Soda Library, which connects Soda to Apache Spark, offering a modern approach to data testing, monitoring, and reliability for engineering teams using PySpark DataFrames. This tool is designed to help data and analytics engineers maintain high data quality in data-intensive environments by providing an API for extracting data metrics and column profiles via YAML configuration files. When connected to a Soda Cloud account, Soda Spark enables teams to configure alerts for failed tests, facilitating quick issue resolution. The tool simplifies the process of writing declarative data quality tests for Spark DataFrames and is compatible with various data workloads, engines, and environments, such as Kafka, AWS S3, and Google BigQuery. With easy installation via pip and integration into platforms like Databricks, Soda Spark supports collaborative efforts through Soda Cloud, which offers features like anomaly detection, schema evolution monitoring, and visualization of test results. Soda's open-source and SaaS offerings emphasize data reliability, with a strong community backing, including major contributors like Disney and HelloFresh.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Data Pipeline	1	655	104	37	+35%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.