Page 02 · SimLabs LLM Visual

Characters, Tokens & Vectors

Large language models don't "read" text directly. Text is first split into tokens that the model can process, then further mapped into vectors. On this page, you can enter a sentence, switch examples, and click tokens to see how they transform into machine-computable representations step by step.

Enter a Sentence and Watch It Get Split

For introductory demonstration purposes, this uses "educational tokenization" rather than a full industrial-grade tokenizer. The goal is to help you understand: text doesn't enter the model directly; it's first split into more stable units.

Character View

First, view text as a stream of fine-grained characters. Characters are the smallest unit in human intuition, but not necessarily the best unit for the model to process.

Token View

For more stable language processing, models often split text into tokens first. Click any token below to see its corresponding "conceptual vector".

Build Intuition First: Before text enters the model, it's usually converted from a "continuous string" into a "discrete token sequence". What actually enters neural network computation is not the token text itself, but the vector representation corresponding to the token.

Click a Token to See Its Vector Representation

Current Token: 我

The "conceptual vector" shown below helps you build intuition that "tokens are projected into numerical vectors", rather than reproducing actual embedding values from a specific model.

What Problem Does This Layer Solve?

  • Characters alone cannot directly perform high-dimensional semantic computation.
  • Tokens convert text into stable, indexable discrete units.
  • Vectors enable the model to express semantic relationships using distance, dot product, and linear transformations.
  • The subsequent Q, K, V operations essentially continue transformations and matching within the vector space.
Common Misconception: Tokens are not necessarily equal to "words", and vectors are not "dictionary definitions". They are representations used by the model for computation and generalization.
One Sentence Summary: Text must first be split into tokens, then mapped into vectors. Almost all subsequent LLM computations occur on these vector representations.