What is data ingestion?
Blog post from Starburst
Data ingestion is the initial step in big data analytics, focused on bringing raw data from various sources into a central repository like a data lakehouse, where it retains its original format. It precedes data integration, which transforms this data to enhance usability by addressing quality issues and applying consistent formats. The process can be executed via batch ingestion, which is traditional and time-consuming, or real-time ingestion, which is faster but resource-intensive and prone to data quality challenges. Starburst Galaxy leverages a data ingestion framework using Trino and Iceberg to provide a managed Kafka ingestion solution, enhancing near real-time data access for analysts and data scientists. This framework runs on Amazon's AWS ecosystem and employs tools like Apache Kafka and Flink to ensure fault tolerance and prevent data duplication while preserving the raw data's potential. The integration of Trino and Iceberg in an "Icehouse" architecture optimizes the platform for machine learning and data science applications, offering improved accessibility and governance for agile decision-making. Starburst Galaxy's platform aims to streamline the data analytics process, making ingestion more reliable and enabling informed data-driven decisions.