Home / Companies / AI21 Labs / Blog / Post Details
Content Deep Dive

Closing the parsing gap: reaching SOTA RTL parsing by leveraging LTR capabilities

Blog post from AI21 Labs

Post Details
Company
Date Published
Author
Yuval Peleg Levy, Algorithm Engineer
Word Count
2,412
Language
English
Hacker News Points
-
Summary

A novel approach called Word Shape Encoding has been developed to improve the parsing of PDF documents written in right-to-left (RTL) languages, such as Hebrew and Arabic, by converting them into a more easily parsed left-to-right (LTR) format. Traditional parsing strategies often struggle with RTL languages, introducing errors and reducing accuracy. This new method involves encoding RTL text by matching it to English words with similar visual geometry, preserving the original document's structure and layout. The approach was tested using a synthetic dataset and showed significant improvements in parsing accuracy, especially for Hebrew, across various models, including Vision-Language Models, modular pipelines, and commercial SaaS solutions. However, results for Arabic were mixed, highlighting the importance of a model's exposure to specific languages during training. The method's dependency on PDF metadata limits its applicability to other formats, prompting the development of a new model trained on data generated by this encoding method, which has shown promise in maintaining high-quality RTL parsing. This innovative strategy underscores the potential of transforming language-specific challenges into problems that existing models are better equipped to handle, suggesting broader applications beyond just text parsing.