/plushcap/analysis/airbyte/airbyte-now-supports-extracting-text-from-documents

Airbyte now supports extracting text from documents

What's this blog post about?

Airbyte now supports extracting text from documents stored in S3, Azure blob storage, and Google Drive sources. The extracted textual content is emitted as markdown, allowing users to leverage this data in search scenarios and when building language model-powered applications. This feature enables the utilization of valuable unstructured data such as meeting notes, specifications, roadmaps, and descriptions of planned features. Airbyte can extract all valuable data from these documents and send it to a warehouse for further processing. The new experimental "Document File Type Format" allows users to extract text content from PDFs, Word, PowerPoint, and Google documents just like structured data stored in CSV or Avro formats.

Company
Airbyte

Date published
Nov. 7, 2023

Author(s)
Joe Reuter

Word count
634

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.