Prompting Large Language Models to Solve Document Understanding
Blog post from Unstructured
Document understanding systems leverage neural networks to transform document images into text, with recent advancements incorporating both text and image data, as seen in Microsoft's UDOP system. These systems are pre-trained on extensive datasets in an unsupervised manner and fine-tuned for specific tasks with supervised training. The introduction of large language models like GPT-3 has shown their potential in handling novel tasks through prompt training, although document understanding poses unique challenges due to its integration of text and image data. Examples such as DocVQA illustrate the need to comprehend documents in the context of specific tasks, such as answering questions about document images. Progress in this field includes methods enabling systems like ChatGPT to incorporate user feedback for refining outputs, demonstrating its capability to perform document understanding tasks by reformatting data accurately. The Unstructured team is exploring these methodologies to develop a more adaptable interface for processing unstructured documents, inviting interested individuals to follow their ongoing research efforts.