Transformer Self-Attention Mechanism Demo
Deconstruct the Q, K, V, score matrix, Softmax, and final context vector in self-attention through matrix visualization, step-by-step explanations, and dynamic highlighting — turning abstract formulas into observable computations.
This page is part of the SimLabs LLM Visualization series. You can also return to the roadmap and learn in order.
We have 3 words (e.g., "The", "cat", "sat"), each represented by a 4-dimensional vector. The input matrix $X \in \mathbb{R}^{3 \times 4}$, where rows represent the number of words and columns represent the dimension of word vectors.
The input matrix $X$ is multiplied by three learnable weight matrices $W^Q, W^K, W^V$ to produce the Query, Key, and Value matrices. The key insight is not "computing the same thing three times," but rather letting the same word take on three distinct roles.
What am I looking for right now?
Q determines the "question" issued by the current word. It tells the model: what kind of information should I focus on finding.
- More like a "search condition"
- Decides who I should pay attention to
What labels do I carry?
K determines how this word can be found by others. Other words will use their own Q to match against my K.
- More like an "index label"
- Decides whether I am worth looking at
If you pay attention to me, what content do I provide?
V is the actual information body that will be taken and aggregated in a weighted sum. Words with high attention will contribute more of their own V.
- More like the "body content"
- Decides what remains in the final output
Why does the same word need to become three different vectors?
Select a word to see how it projects from the original input $X$ into Q, K, and V respectively. This makes it easier to see that the same word simultaneously plays three roles in attention: "questioner," "retrieved entity," and "information provider."
Original Input X
This is the word's basic semantic representation, not yet differentiated into the three uses of "searching," "matching," and "providing content."
Becomes Q
Rewrites it into a vector representing "what features am I looking for right now," used to actively ask questions.
Becomes K
Rewrites it into a vector representing "what labels do I carry," waiting for others to match against.
Becomes V
Rewrites it into a vector representing "if someone pays attention to me, what information should I contribute."
Connecting the subsequent process together: Currently observing {{ selectedTokenDetails.label }}. It will first take its own Q to dot product with the K of every word in the sentence for scoring, then distribute the resulting attention weights across the V of each word, finally forming the output vector Z = {{ formatVector(selectedTokenDetails.z) }}.
Multiply the Query matrix $Q$ by the transpose of the Key matrix $K^T$ to obtain the raw attention score matrix. Each element $\text{Scores}_{ij}$ represents the degree of attention (relevance) that the $i$-th word pays to the $j$-th word.
Cell-by-cell look at how Q × KT is multiplied
Currently highlighted is the Query row of {{ activeScoreDemo.queryLabel }}, and the corresponding Key column of {{ activeScoreDemo.keyLabel }} after transposition. After their dot product, the result will land in the target cell of Scores.
The Q vector of {{ activeScoreDemo.queryLabel }} is {{ formatVector(activeScoreDemo.qVector) }}, and the K vector of {{ activeScoreDemo.keyLabel }} is {{ formatVector(activeScoreDemo.kVector) }}.
First, scale the score matrix by dividing by $\sqrt{d_k}$ (here $d_k = 3$, so divide by $\sqrt{3} \approx 1.732$) to prevent vanishing gradients. Then apply the Softmax function to each row, converting the scores into a probability distribution (each row sums to 1), yielding the final attention weight matrix.
Multiply the attention weight matrix by the value matrix $V$, performing a weighted sum. The resulting output matrix $Z$ is the new word vector representation that integrates the contextual information of the entire sentence. Each word's new representation contains information about its attention to other words.
Transformer Core: Self-Attention Mechanism Principle Analysis
Introduction: From Word to Vector
Computers cannot directly understand text; they can only process numbers. Therefore, we need to convert each word into a numerical vector (Embedding). For example, in the sentence "The cat sat," each of the three words is represented as a 4-dimensional vector, forming the input matrix $X \in \mathbb{R}^{3 \times 4}$ (3 words, each 4-dimensional).
1. Core Magic: Self-Attention Mechanism
In traditional sequence models (such as RNN/LSTM), text is processed one word at a time in order. The disruptive innovation of the Transformer is that it breaks the temporal order, allowing all words in a sentence to examine each other simultaneously — this is the self-attention mechanism.
1.1 Core Roles: Q, K, V
The core of the self-attention mechanism is to multiply the input matrix $X$ by three learnable weight matrices respectively, obtaining three new matrices:
- Q (Query): What features does the current word want to find? (For example, "bank" acts as the subject querying other words to determine whether it means "financial institution" or "riverbank.")
- K (Key): What feature labels does each word in the sentence carry? Used to match with the Query.
- V (Value): If a match is found, what specific information content can this word actually provide.
Formulas for generating Q, K, V:
$$ Q = XW^Q, \quad K = XW^K, \quad V = XW^V $$where $W^Q, W^K, W^V \in \mathbb{R}^{d_{model} \times d_k}$ are learnable weight matrices.
A quick way to remember Q / K / V:
Q is like a search term, K is like the label attached to each word, and V is like the actual content behind the label.
The current word first uses its own Q to scan the K of all words in the sentence, deciding "who to look at"; then it aggregates the V of those words by weight to form a new contextual representation.
1.2 Mathematical Derivation Formula
The entire self-attention computation can be condensed into an elegant matrix formula:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$1.3 Step-by-Step Formula Breakdown
- Step 1: Compute relevance scores $QK^T$
Use the Query matrix $Q$ to dot product with the transpose of the Key matrix $K$ of all words. The larger the dot product, the better the features of the two words match, indicating higher relevance. The result is a $3 \times 3$ scoring matrix, where $\text{Scores}_{ij}$ represents the degree of attention the $i$-th word pays to the $j$-th word. - Step 2: Scale
Divide by $\sqrt{d_k}$ (in this demo $d_k = 3$, so divide by $\sqrt{3} \approx 1.732$). This prevents the dot product results from becoming too large in higher dimensions, which would cause the gradient of the subsequent Softmax function to vanish. - Step 3: Softmax Normalization
Apply the Softmax operation to each row of the scaled scoring matrix, converting it into a probability distribution (each row sums to 1). This represents the attention weights the current word assigns to all words in the sentence. - Step 4: Weighted Sum
Multiply the attention weight matrix just computed by the value matrix $V$, performing a weighted sum. The final output $Z$ is a brand-new word vector representation that integrates the contextual information of the entire sentence.
2. Multi-Head Attention
In an actual Transformer, multiple attention heads (Multi-Head) are used. Each head uses different $W^Q, W^K, W^V$ weight matrices to understand the sentence from different subspaces (perspectives):
- Some heads may focus on grammatical structure (subject-verb-object relationships)
- Some heads may focus on semantic similarity (synonyms, near-synonyms)
- Some heads may focus on tense and modality
3. Positional Encoding
The self-attention mechanism itself is orderless (similar to a bag-of-words model); it cannot distinguish "cat bites dog" from "dog bites cat." Therefore, the Transformer must inject positional information through positional encoding. The common method uses sine and cosine functions:
The positional encoding is added to the word embedding vector to form the final input representation.
4. Feed-Forward Network & Residual Connection
After the self-attention layer, the Transformer also passes through a Feed-Forward Network:
Meanwhile, to prevent vanishing gradients and accelerate training, the Transformer uses residual connections and layer normalization:
5. Summary
The core innovation of the Transformer lies in the self-attention mechanism, which enables the model to process the entire sequence in parallel and dynamically attend to different parts of the sequence. Through the ingenious design of the Q, K, V matrices, the Transformer achieves efficient context modeling, becoming the cornerstone of modern NLP (such as GPT, BERT) and multimodal models (such as CLIP, Stable Diffusion).
Self-Attention Mechanism: Why It Is Effective
The self-attention mechanism is powerful because it fundamentally solves two core shortcomings of traditional neural networks (such as RNN, LSTM) when processing sequence data: the long-distance dependency challenge and the parallel computation bottleneck. The following breaks down the roots of its effectiveness from five dimensions.
1. Solving the "Forgetting" Problem: Long-Distance Dependencies
In RNNs or LSTMs, information is passed sequentially; the longer the sentence, the more easily information from the beginning attenuates (vanishing gradient), causing the model to "forget the beginning."
- Self-Attention Advantage: Each word can directly interact and compute with all words in the sequence (including the very first words).
- Physical Meaning: No matter how far apart two words are, self-attention can
establish a "shortcut connection" to directly measure their relevance.
For example, in the sentence: "The person who was walking the dog in the park wore a red coat." Self-attention allows "person" and "red coat" to be directly associated, unaffected by the long intervening phrase.
2. Parallel Computation and the "God's-Eye View"
RNNs must compute sequentially by time step (processing the first word before the second), limiting GPU parallel capability.
- Self-Attention Advantage: It can receive the entire sentence at once, simultaneously computing correlations between all words through matrix operations.
- Physical Meaning: It is equivalent to having a "God's-eye view" at the very first step, seeing the full picture of the sentence rather than crawling step by step. This fully leverages the parallel capabilities of modern hardware, significantly boosting training speed.
3. Dynamic Weight Allocation: Focusing on What Matters
The convolutional kernel parameters of Convolutional Neural Networks (CNNs) are static; regardless of how the input content changes, the way features are extracted is fixed.
- Self-Attention Advantage: Attention weights are dynamically generated, entirely dependent on the current input content.
- Physical Meaning: The model can adjust its focus in real-time based on context. For example, when processing the pronoun "it," self-attention assigns higher weights to preceding nouns (such as "apple" or "car"), aiding disambiguation. This dynamic nature gives the model tremendous flexibility.
4. Multiple Representations: Building a Rich Semantic Map
The Transformer uses Multi-Head Attention, meaning multiple sets of self-attention computations are executed in parallel.
- Self-Attention Advantage: Different "heads" can learn different levels of dependency relationships.
- Physical Meaning:
- Some heads focus on grammatical relationships (such as subject and predicate);
- Some heads focus on coreference relationships (such as "Xiao Ming" and "he");
- Some heads focus on adjacent word modification (such as adjectives and nouns).
5. Mathematical Essence: Weighted Sum of Feature Vectors
Core self-attention formula:
Attention(Q, K, V) = softmax( QKT / √dk ) V
- QKT: Compute a similarity matrix, measuring how well each element matches every other element.
- Softmax: Convert matching scores into a probability distribution (weights), making all weights sum to 1.
- Multiply by V: Perform a weighted sum of all Value vectors using the computed weights — the larger the weight, the greater the contribution of that position to the final output.
Summary
The self-attention mechanism breaks through the bottlenecks of traditional models in long-sequence modeling and semantic capture through global perception, dynamic weighting, and parallel computation. It not only allows the model to "see" each word but also deeply understands the interactions between words — this is the fundamental reason why it is effective.