Current Prefix
The model can only see these tokens right now, and its task is to predict "what is most likely to come next."
The core task of training a large language model is not to "memorize entire answers," but to maximize the probability of the correct next token at each position. Think of it this way: given a prefix, the model repeatedly practices "what should come next" and uses loss to measure how far it is from the correct answer.
First select a prefix scenario, then adjust the "Training Progress" slider. Higher training progress means higher probability assigned to the correct token, and corresponding lower loss. This is the core optimization direction of language model training.
The model can only see these tokens right now, and its task is to predict "what is most likely to come next."
If the model assigns most probability to the correct token, loss decreases; if it wastes probability on wrong candidates, loss remains high.
During training, scoring doesn't happen only at the end of a sentence. Instead, a "prefix → next token" supervision signal is generated at every position. Click any position below to see what input and target the model actually faced.
After seeing the prefix, the model needs to assign more probability to the actual next token.
A training sample typically contributes loss at multiple positions, which are then aggregated into the overall objective. Think of it this way: the model tries to make fewer mistakes at every step of the sentence.
Essentially, it's about continuously increasing the probability of the correct next token, not understanding the "entire answer" at once.
High loss usually means the correct token's probability is still too low; low loss means the model is closer to the correct distribution at that position.
Each position in a sequence generates supervision signals, so the model learns at every step of the sequence, not just at the end.