Page 17 · SimLabs LLM Visual

Residual, LayerNorm & FFN: The Other Half Beyond Attention

Many people, when first learning Transformer, treat the attention mechanism as the entire focus. However, for a Block to be truly stable, trainable, and able to continuously refine representations, it relies not only on Attention, but also on residual connections to preserve the backbone, LayerNorm to stabilize the scale, and FFN to perform an additional small network processing within each token.

Track the same token through three components Understand "retain, normalize, reprocess" Complete the intuition beyond Attention

First fix on one token, and see what these three components each do

Choose a component, then select the token you want to track. The vector changes before and after it enters that component will be shown directly below. It is recommended to view Residual first, then LayerNorm, and finally FFN, as this more closely matches the processing order in a real Block.

Before entering the component

Previous stage representation of the current token

After passing through the current component

New representation of the current token

Residual: It is not repetition, but stacking increments onto the backbone

Drag the slider below to change the strength of the "new information increment" stacked back onto the backbone. You will see that the key to Residual is not replacing the old representation, but adding the newly learned delta back while preserving the original representation.

Adjust Increment Strength

Increment coefficient λ 1.00

How the output vector changes

LayerNorm: It does not add knowledge, but tidies up numerical scale

LayerNorm normalizes each token's own row independently, without mixing information across tokens. Below, the left side is the matrix before normalization, and the right side is the matrix after normalization. The highlighted row is the token you are currently tracking.

Before Normalization
After Normalization

Dimension changes of the current token

Dimension changes after LayerNorm

FFN: Each token passes through its own small network

FFN is not responsible for cross-token communication—that is Attention's job. What FFN does is: take the representation each token has already obtained, feed it into the same small network, and further refine stronger local features. You can think of it as "each token having its own small shared-parameter processor."

FFN Input
FFN Output

Hidden layer of the current token

What this layer is refining

After studying this page, it's best to remember three things

Residual is responsible for preserving the backbone

It allows new information to be stacked back onto the old representation in the form of "increments," rather than discarding the entire old representation. This is one of the key reasons deep networks become easier to train.

LayerNorm is responsible for tidying up numerical values

It does not add new semantics, but it makes the internal numerical distribution of each token more stable, allowing subsequent sub-layers to continue working on a more controllable scale.

FFN is responsible for internal reprocessing within each token

Attention is responsible for "looking at others," while FFN is responsible for "processing itself." With the two working together, a Block can both integrate context and continue refining local features.