Tokenizer and BPE
Many people learning LLMs for the first time ask: Why doesn't the model read text directly by "words"? The reason is that natural language contains new words, inflections, compound words, and long-tail expressions. The Tokenizer's job is to split text into more stable, reusable tokens; BPE is one of the most classic subword learning methods.
Watch BPE Grow Subwords Step by Step
Below is not an industrial-grade tokenizer trainer, but an educational BPE demonstration. You can switch corpora, play the merge process step by step, and see how "frequent character pairs" gradually become longer subwords.
Each step counts the most frequent adjacent pairs in the current vocabulary. BPE defaults to merging the pair with the highest frequency first.
The Same Word Gets Split Differently at Different Merge Stages
Enter a word or phrase to see how the tokenizer would split it when it has only learned the first few rounds of merges. As you continue playing, you'll see tokens gradually transform from "pure characters" into "more stable subwords".
Why LLMs Prefer Subwords Over Fixed Dictionaries
Better Handling of New Words
Even if a complete word hasn't been seen before, as long as it can be split into known subwords, the model can still process it rather than treating it as a completely unknown OOV (Out of Vocabulary).
More Efficient Vocabulary
Storing all possible words would cause extreme vocabulary bloat. Subwords achieve a better balance between vocabulary size and expressive power.
More Reusable Structures
Shared fragments like play / player / playing allow the model to more easily reuse statistical patterns.
Prepares for Subsequent Layers
The Tokenizer ultimately produces a sequence of tokens. The subsequent embedding and attention layers are built upon these tokens.