Tokenization — Glossary Aria Research

Extended definition

Tokenization is the process that converts raw text into the sequence of discrete units, the tokens, that a language model actually processes. There is no universal token: the choice of unit is a design decision. Current models use subword tokenization, a middle ground between operating on whole words, which yields enormous vocabularies and fails on rare words, and operating on characters, which produces overly long sequences. The most widespread algorithm is byte pair encoding, proposed for translation by Sennrich and colleagues (2016), which starts from characters and iteratively merges the most frequent pairs until it forms a subword vocabulary. Kudo (2018) introduced subword regularization, based on a unigram model that samples alternative segmentations and makes the model more robust. Kudo and Richardson (2018), with SentencePiece, made the process language-independent by training directly on raw text, without pre-segmentation into words. The result is a fixed vocabulary that covers any input by decomposing the unknown into known pieces.

When it applies

Tokenization applies to every stage involving a language model, from training to inference. It applies when sizing cost and limit: in commercial models, price and context-window size are counted in tokens, not words, and estimating a text’s token count is what lets one predict cost and fit within the limit. It applies when handling morphologically rich languages or non-Latin scripts, where the token-per-word count differs greatly from English. It applies when preparing data for fine-tuning, where the tokenizer must be the same as the base model’s. And it applies to debugging behavior: many of a model’s apparent errors are explained by how the text was fragmented before reaching it.

When it does not apply

Subword tokenization does not apply as a negligible detail: treating it as a black box hides real effects on cost, performance, and fairness across languages. It does not apply to free swapping of the tokenizer: using a tokenizer different from the one the model saw in training degrades performance, because the token identifiers no longer correspond to what the model learned. It does not apply as a stable measure across models: the same text yields different token counts depending on the tokenizer, so comparing cost across vendors requires using each one’s tokenizer. And it does not apply as a semantically neutral operation: fragmentation can split units of meaning, numbers, and code in ways that affect what the model can do.

Applications by field

Language models: definition of the subword vocabulary that conditions cost, context window, and vocabulary coverage.
Machine translation: the historical origin of byte pair encoding, which solved the problem of rare and out-of-vocabulary words.
Low-resource languages: language-independent tokenizers that handle diverse scripts without pre-segmentation.
Prompt engineering and cost: token estimation to predict price and fit inputs to the context limit.

Common pitfalls

The first pitfall is counting words when what matters is tokens: budget and context limit are measured in tokens, and the ratio between the two varies by language. The second is swapping the tokenizer between training and inference, breaking the correspondence the model learned. The third is assuming token counts are comparable across models, when each tokenizer segments in its own way. The fourth is ignoring cross-language bias: texts in non-English languages tend to spend more tokens per word, making the same content costlier. The fifth is failing to inspect the segmentation when debugging: errors with numbers, dates, or code often come from how the tokenizer split the input, not from a reasoning failure of the model.