Gotchas in Tokenizer Behavior Every Developer Should Know

Post Details

Company

HuggingFace

Date Published

April 18, 2025

Author

Quentin Gallouédec

Word Count

2,659

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/qgallouedec/gotchas-in-tokenizer-behavior

Summary

Quentin Gallouédec's blog post explores various nuances and pitfalls developers might encounter when working with tokenizers in natural language processing. A primary focus is the presence and usage of the Beginning of Sentence (BOS) and End of Sentence (EOS) tokens, which can vary significantly across different models, such as Qwen/Qwen2.5-0.5B and microsoft/Phi-3-mini-128k-instruct. The post highlights that not all tokenizers include a BOS token, and even if present, it might not be utilized in the tokenization process. Similarly, the EOS token is not automatically added during standard tokenization, although it might appear when applying chat templates, albeit unpredictably across models. The post also delves into the potential conflicts when BOS and EOS tokens share the same ID as the padding token, which can lead to issues during masking. Another key point is that applying chat templates is not a homomorphism with respect to concatenation, and special tokens can complicate the sequencing of chat template application and tokenization. Gallouédec emphasizes the importance of updating the EOS token to match any special end-of-turn tokens in chat templates to avoid issues such as infinite generation.