Attention Distribution of Current Head
See which tokens it wants to reference most.
单个 Attention 头当然也能算权重,但它往往只能给出一种“混合视角”。多头注意力做的事不是简单复制四遍,而是让不同 Head 用不同的 WQ / WK / WV 去看不同关系,再把这些子结果拼起来,形成更完整的 token 表示。
先选一个场景,再选当前正在更新的 query token。下面四张卡片代表四个不同的注意力头。它们看到的是同一句话,但会各自把注意力压在不同位置上,因此学到的是不同类型的关系。
See which tokens it wants to reference most.
The left side averages the weights of four heads, approximating a "single mixed perspective"; the right side preserves each head's independent row. You'll see that the value of multi-head attention isn't just width—it's about modeling different relationships separately first, then letting subsequent projection layers fuse them.
If all relationships are blended into one row, usually only a few "compromise peaks" remain most prominent.
Highlight explanation: Different heads often focus on different targets, which is why there's something to concatenate later.
Each head first produces a sub-vector. What actually gets sent to subsequent layers isn't any single head's result—it's first Concat(head1, ..., headh), then projected back to model width via WO. You can manually toggle which heads participate in fusion to see which clues the final representation loses.
Concat preserves "what each head saw separately" without blending them together yet.
Final = Concat × WO. This step re-fuses multi-subspace information into a new token representation.
Different heads have different parameters, so they learn not copies of the same weight row, but different relationship perspectives in different subspaces.
The key structure of multi-head attention is "let each head model separately first, then aggregate with Concat + WO", rather than mixing all relationships together from the start.
The real improvement is that the same token can carry more relationship clues simultaneously—like syntax, coreference, time, and local collocations—without forcing one head to handle everything.