Getting Started with Unstructured and IBM watsonx.data
Blog post from Unstructured
Unstructured provides a streamlined approach for converting various unstructured data formats, such as PDFs and emails stored in cloud storage, into structured formats like JSON or embeddings, which are essential for Generative AI (GenAI) workloads. By using Unstructured, users can bypass the need for multiple tools and scripts to process these data types. The tool enables connectivity to cloud data sources, allowing files to be parsed and structured through its API, then sent to destinations such as IBM watsonx.data without manual parsing or glue code. The process involves setting up source and destination connectors, configuring a processing workflow with partitioning strategies tailored to different document types, and optionally adding chunking and embedding for downstream applications. This setup facilitates an automated pipeline from raw files in Azure Blob Storage to structured, searchable data in IBM watsonx.data, supporting retrieval-augmented generation (RAG) pipelines and other LLM-powered tools.