Page 16 · SimLabs LLM Visual

多头注意力:为什么要分成多个 Head

单个 Attention 头当然也能算权重,但它往往只能给出一种“混合视角”。多头注意力做的事不是简单复制四遍,而是让不同 Head 用不同的 WQ / WK / WV 去看不同关系,再把这些子结果拼起来,形成更完整的 token 表示。

看同一句话被不同 Head 分头理解 对比单头平均和多头分工 看懂 Concat + WO

同一句话,不同 Head 会分头看不同线索

先选一个场景,再选当前正在更新的 query token。下面四张卡片代表四个不同的注意力头。它们看到的是同一句话,但会各自把注意力压在不同位置上,因此学到的是不同类型的关系。

Selected Head

Attention Distribution of Current Head

See which tokens it wants to reference most.

What Does This Head Actually Do?

Why Single Head Is Not Enough: Key Clues Get Blended Together After Averaging

The left side averages the weights of four heads, approximating a "single mixed perspective"; the right side preserves each head's independent row. You'll see that the value of multi-head attention isn't just width—it's about modeling different relationships separately first, then letting subsequent projection layers fuse them.

Single-Head Mixed View

Distribution After Averaging Four Heads

If all relationships are blended into one row, usually only a few "compromise peaks" remain most prominent.

Multi-Head Division

Each Head Maintains Its Own Attention Row

Highlight explanation: Different heads often focus on different targets, which is why there's something to concatenate later.

Tip: Repeated highlights in the same column indicate multiple heads reaching consensus; highlights in different columns indicate division of labor.

Concat + WO: Reassembling Multiple Small Vectors into a Complete Representation

Each head first produces a sub-vector. What actually gets sent to subsequent layers isn't any single head's result—it's first Concat(head1, ..., headh), then projected back to model width via WO. You can manually toggle which heads participate in fusion to see which clues the final representation loses.

Sub-vectors from Each Head

Head Outputs

Concat

Concatenated Long Vector

Concat preserves "what each head saw separately" without blending them together yet.

Output Projection

Re-project Back to Model Space

Final = Concat × WO. This step re-fuses multi-subspace information into a new token representation.

Three Key Takeaways from This Page

Multi-head Isn't Just Repeating Four Times

Different heads have different parameters, so they learn not copies of the same weight row, but different relationship perspectives in different subspaces.

Divide First, Then Fuse

The key structure of multi-head attention is "let each head model separately first, then aggregate with Concat + WO", rather than mixing all relationships together from the start.

It Makes Representations Richer, Not Just Larger

The real improvement is that the same token can carry more relationship clues simultaneously—like syntax, coreference, time, and local collocations—without forcing one head to handle everything.