Prompting Large Language Models to Solve Document Understanding
Blog post from Unstructured
Recent advancements in document understanding involve using neural networks to transform image representations of documents into text, with systems like Microsoft's UDOP combining text and images for more comprehensive processing. These systems are pre-trained on vast datasets of image-text pairs, allowing them to learn document reading and text generation unsupervised, and later fine-tuned for specific tasks with supervised training. Large language models like GPT-3 have shown strong capabilities in performing novel tasks based on prompts, raising interest in their application to document understanding, which requires different considerations due to the integration of images and text. The process involves understanding documents in the context of expected tasks, as demonstrated by datasets like DocVQA, which pair document images with user questions. Recent research has also explored using models like ChatGPT to respond to corrections through reinforcement learning, showcasing abilities in document tasks such as converting unformatted data into structured tables. The Unstructured team is exploring these methods to develop flexible interfaces for processing unstructured documents, encouraging engagement through platforms like LinkedIn, Huggingface, and GitHub.