LLM training pipelines with Langchain, Airbyte, and Dagster

Company

Dagster

Date Published

July 5, 2023

Author

Colton Padden

Word count

1490

Language

English

Hacker News points

None

URL

dagster.io/blog/training-llms

Summary

Training Large Language Models (LLMs) requires contextual data, which is often dispersed across various sources. To maintain this data's freshness, a robust pipeline is necessary, transcending ad hoc scripting. Dagster can orchestrate services involved in LLM training by running ingestion tasks, transforming and structuring data, and making it available for the LLM. When paired with an ingestion tool like Airbyte and a framework for language models like LangChain, the task of making data accessible to LLMs becomes feasible, maintainable, and scalable. The pipeline involves three steps: data ingestion using Airbyte, configuring the pipeline in Dagster, and loading the data. The final code can be found on Github, and prerequisites include Python 3, Docker, an OpenAI API key, and specific dependencies. The example showcases how to use Airbyte and Dagster to bring data into a format that can be used by LangChain for question-answering applications. The pipeline can be materialized from the command line or deployed in production using Dagster's features.