Attention scores before applying the mask
This might assign very high scores to future tokens, which is exactly the "peeking into the future" loophole.
During autoregressive generation, a decoder can only see tokens up to and including its current position; it cannot peek ahead at future tokens. Otherwise, the model would "cheat" by looking at the answers during training, but wouldn't have access to that future information during inference, leading to inconsistent behavior. The causal mask's role is to strictly prevent this leakage of future information.
First, choose a scenario, then click on the "currently computing token". The raw score matrix above shows "who it would most want to attend to if there were no restrictions"; the causal mask result on the right shows "which future positions are forcibly masked".
This might assign very high scores to future tokens, which is exactly the "peeking into the future" loophole.
Future columns are replaced with a very small value, forcing their weights to zero after the Softmax operation.
On the left below is the attention distribution "if peeking at the future were allowed"; on the right is the distribution "actually used for autoregressive generation". The key point is not the scores themselves, but whether the weights for future positions have been forced to zero.
0%
Without the mask, what percentage of the current query's attention would flow to future positions.
100%
After applying the mask, the model can still fully attend to the history and the current position; it's not "completely blocked".
Train = Infer
Both training and inference adhere to the same visibility scope, ensuring the model doesn't learn to rely on information during training that won't be available during inference.
Requires a causal mask. The goal is sequential generation, so the current position cannot see subsequent tokens in advance.
Typically does not require a causal mask. For tasks like classification, encoding, or representation retrieval, the entire input sequence is fully visible from the start.
Usually does not apply this kind of "future masking". When the decoder reads the encoder's output, it accesses the complete input information, which does not constitute a leak of future answers.