Previous stage representation of the current token
Residual, LayerNorm & FFN: The Other Half Beyond Attention
Many people, when first learning Transformer, treat the attention mechanism as the entire focus. However, for a Block to be truly stable, trainable, and able to continuously refine representations, it relies not only on Attention, but also on residual connections to preserve the backbone, LayerNorm to stabilize the scale, and FFN to perform an additional small network processing within each token.
First fix on one token, and see what these three components each do
Choose a component, then select the token you want to track. The vector changes before and after it enters that component will be shown directly below. It is recommended to view Residual first, then LayerNorm, and finally FFN, as this more closely matches the processing order in a real Block.
New representation of the current token
Residual: It is not repetition, but stacking increments onto the backbone
Drag the slider below to change the strength of the "new information increment" stacked back onto the backbone. You will see that the key to Residual is not replacing the old representation, but adding the newly learned delta back while preserving the original representation.
Adjust Increment Strength
How the output vector changes
LayerNorm: It does not add knowledge, but tidies up numerical scale
LayerNorm normalizes each token's own row independently, without mixing information across tokens. Below, the left side is the matrix before normalization, and the right side is the matrix after normalization. The highlighted row is the token you are currently tracking.
Dimension changes of the current token
Dimension changes after LayerNorm
FFN: Each token passes through its own small network
FFN is not responsible for cross-token communication—that is Attention's job. What FFN does is: take the representation each token has already obtained, feed it into the same small network, and further refine stronger local features. You can think of it as "each token having its own small shared-parameter processor."
Hidden layer of the current token
What this layer is refining
After studying this page, it's best to remember three things
Residual is responsible for preserving the backbone
It allows new information to be stacked back onto the old representation in the form of "increments," rather than discarding the entire old representation. This is one of the key reasons deep networks become easier to train.
LayerNorm is responsible for tidying up numerical values
It does not add new semantics, but it makes the internal numerical distribution of each token more stable, allowing subsequent sub-layers to continue working on a more controllable scale.
FFN is responsible for internal reprocessing within each token
Attention is responsible for "looking at others," while FFN is responsible for "processing itself." With the two working together, a Block can both integrate context and continue refining local features.