There is no such thing as a tokenizer-free lunch
Blog post from HuggingFace
Tokenization is an essential process in language modeling that involves segmenting text into discrete units that a model can understand. Despite its importance, tokenization often receives negative attention, especially when blamed for issues in language models, leading to a lack of interest and research in the field. The blog post argues that all methods, including so-called "tokenizer-free" approaches like byte-level and dynamic tokenization, inherently involve some form of tokenization, as they still rely on fixed vocabularies of bytes or characters. The author emphasizes the importance of continued research and engagement with tokenization methods, highlighting their benefits and the misconceptions surrounding them. The post also addresses the broader trend within the field to undervalue preliminary steps like data curation and tokenization, which are crucial for the development of effective language models.