Word Vector Matrix
Word vectors for the same token depend only on the word itself, not on its position in the sequence.
Self-Attention itself doesn't know the difference between "the first word" and "the third word". If we only treat tokens as an unordered set of vectors, cat bites dog and dog bites cat would look too similar in many ways. The task of positional encoding is to explicitly inject the concept of "order" into the model.
First switch the sentence order, then click on a position. The left side shows word vectors, position vectors, and final input vectors. The right side explains what changes occur at that position.
Word vectors for the same token depend only on the word itself, not on its position in the sequence.
Different positions receive different position signals. Even for the same token, the position vector will be different.
The model actually reads the sum of both vectors, not just the word vector alone.
This experiment uses a simplified 4-dimensional positional encoding to demonstrate the numerical patterns at different positions. Drag the position slider to observe how some dimensions change rapidly while others change slowly, allowing the model to perceive both local and long-range positional relationships.
Positional encoding is not random numbering, but a vector with continuous variation patterns. Adjacent positions differ from each other but maintain continuity.
Without positional information, the model sees "a bag of tokens" rather than an ordered sequence.
The model input is not "either word vectors or position vectors", but the sum of both passed to subsequent attention calculations.
It's not just a nice-to-have trick, but one of the fundamental conditions that allows Transformers to handle sequential information.