Used to compare current changes.
Transformer Block Architecture
Understanding Q, K, V alone is not enough, because real models don't stop after one attention operation. A Transformer Block chains together LayerNorm, Self-Attention, Residual Add, and FFN, allowing each token to absorb context while preserving its original information.
Treat a Block as a Traceable Pipeline
Click any stage below, then switch the token to track. The page shows the full sentence matrix at that stage and how the current token changes before and after this step. This uses a teaching version of Pre-Norm structure to help build stable intuition.
What this step does to the full sentence representation.
What Role Does Each Component Play?
LayerNorm
Doesn't add new knowledge, but reorganizes the internal numerical range of each token. This makes subsequent attention and FFN operations easier to compute on a stable scale.
Self-Attention
Allows the current token to look at other tokens in the sentence, pulling back context information most relevant to itself to form a contextualized representation.
Residual Add
Not redundant repetition, but preserves both the "original signal" and "newly learned changes" on the same pathway, preventing deep networks from becoming harder to train as they grow.
FFN
Although it doesn't communicate across tokens, it performs deeper non-linear transformations within each token, complementing Attention's limitation of only "looking at others".
What Happens If We Remove One Component?
No Residual
Each layer can only rely on newly computed results, original information gets easily overwritten, and deeper networks become harder to train stably.
No LayerNorm
Numerical scales drift more easily, making training more sensitive, especially with deep stacking.
Attention Only, No FFN
The model excels at "moving information between tokens" but struggles to extract more complex features within each token.
FFN Only, No Attention
Each token can only process within itself, unable to see other positions in the sentence, making long-distance relationships hard to understand.