Home / Companies / Unstructured / Blog / Post Details
Content Deep Dive

Getting Started with Unstructured and Delta Tables in Databricks

Blog post from Unstructured

Post Details
Company
Date Published
Author
Maria Khalusova
Word Count
1,761
Language
English
Hacker News Points
-
Summary

Fragmented enterprise data poses challenges for retrieval-augmented generation (RAG) systems and large language model (LLM) agents due to its distribution across various platforms and formats, such as PDFs and Word documents. Unstructured, in combination with Databricks, offers a solution by providing a preprocessing layer that connects to diverse data sources, extracting and transforming the data into structured JSON format. This guide details setting up a data preprocessing workflow that converts documents from Amazon S3 into organized data ready for RAG use with Databricks Vector Search. The process involves creating source and destination connectors, partitioning the data into structured JSON, chunking and embedding the data for similarity searches, and storing it in a Delta Table on Databricks. Users are guided through the steps of setting up the necessary accounts, configuring workflows, and tracking job progress, ultimately transforming unstructured data into actionable insights for RAG applications.