Closing the parsing gap: reaching SOTA RTL parsing by leveraging LTR capabilities

Post Details

Company

AI21 Labs

Date Published

Jan. 22, 2026

Author

Yuval Peleg Levy, Algorithm Engineer

Word Count

2,412

Company Posts That Month

9

Language

English

Hacker News Points

-

Source URL

www.ai21.com/blog/rtl-pdf-parsing

Summary

A novel approach called Word Shape Encoding has been developed to improve the parsing of PDF documents written in right-to-left (RTL) languages, such as Hebrew and Arabic, by converting them into a more easily parsed left-to-right (LTR) format. Traditional parsing strategies often struggle with RTL languages, introducing errors and reducing accuracy. This new method involves encoding RTL text by matching it to English words with similar visual geometry, preserving the original document's structure and layout. The approach was tested using a synthetic dataset and showed significant improvements in parsing accuracy, especially for Hebrew, across various models, including Vision-Language Models, modular pipelines, and commercial SaaS solutions. However, results for Arabic were mixed, highlighting the importance of a model's exposure to specific languages during training. The method's dependency on PDF metadata limits its applicability to other formats, prompting the development of a new model trained on data generated by this encoding method, which has shown promise in maintaining high-quality RTL parsing. This innovative strategy underscores the potential of transforming language-specific challenges into problems that existing models are better equipped to handle, suggesting broader applications beyond just text parsing.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
AI Model Fine-tuning	2	532	129	59	-12%
RAG	2	849	194	70	-7%