Page 11 · SimLabs LLM Visual

Self-Attention Intuition

If each token only looks at itself, language understanding would be very fragile. Pronouns wouldn't know who they refer to, adjectives wouldn't know what they modify, and actions wouldn't know who they act upon. The first layer of intuition for Self-Attention is to let each token, when updating itself, first look at the whole sentence to see which other tokens are most important to it.

First build linguistic intuition Then see weight distribution Finally connect to Q × K^T

Switch a scenario to see who the current token is "looking at"

Choose different linguistic scenarios, then switch the current focus token. The page will highlight the words in the sentence that receive major attention and display a row of attention weights. You can also auto-play to let different tokens become the "current query" in sequence.

Current Focus: Pronoun "它"
Current Attention Distribution

This row is not the "final correct answer," but rather the reference strength the current token gives to other tokens in the whole sentence when updating its own representation.

Without Context vs. With Context

Put "who is looking at whom" into a small table

The table below puts the attention rows of several key tokens in the current scenario together. Horizontally it shows "who it is looking at," and vertically it shows "who is initiating the query." The highlighted row is the query you currently have selected.

Key Understanding: Self-Attention does not just keep one strongest connection; it forms an entire row of weight distribution. The model usually references multiple positions simultaneously, just with varying strengths.

What is the relationship between this page and the QKV math formula

Current Focus Token

It corresponds to a Query row in the later formula. You can understand it as "I now want to update myself, and I need to find which contexts are relevant."

Other Attended Tokens

They correspond to the Key and Value positions in the later formula. Key is responsible for judging relevance, and Value is responsible for providing the actual information to be taken away.

This Row of Attention Bars

It is the intuitive version of softmax(QK^T). A higher value indicates that the position is more worthy of reference.

Next Page You Can Enter the Formula

If you have accepted that "a token will look back at the whole sentence and form a row of weights," then entering Q, K, V and the scoring matrix will feel much more natural.

One-sentence summary: The essence of Self-Attention is not to let a token aimlessly look at the whole sentence, but to let it dynamically pick out the most helpful contextual clues for its current update.