Next Token Prediction

Next token prediction is the core training objective behind most modern large language models (LLMs). Given a sequence of tokens — words, subwords, or characters — the model is trained to predict what token comes next. By repeating this process across billions of examples, the model learns grammar, facts, reasoning patterns, and stylistic conventions purely from the statistical structure of text. Despite its simplicity, this single objective is remarkably powerful: a model that can consistently predict the next token must, in some sense, understand the content it is processing.

The mechanics rely on computing a probability distribution over the entire vocabulary at each position in the sequence. During training, the model's predicted distribution is compared against the actual next token using cross-entropy loss, and gradients are backpropagated to update model weights. Modern transformer-based architectures process all token positions in parallel using causal (masked) self-attention, which prevents any position from attending to future tokens — preserving the integrity of the prediction task while enabling efficient training at scale.

Next token prediction serves a dual role: it is both a pre-training objective and, implicitly, a generative mechanism. At inference time, models sample or select from the predicted distribution repeatedly, appending each generated token to the context and feeding it back in — a process called autoregressive generation. This is how GPT-style models produce coherent paragraphs, answer questions, write code, and engage in dialogue. The same objective that trains the model also defines how it generates output.

The significance of next token prediction became fully apparent with the GPT series beginning in 2018, which demonstrated that scaling this simple objective on large corpora produced models with broad, transferable capabilities. It has since become the dominant pre-training paradigm for foundation models. Researchers have noted that next token prediction implicitly incentivizes world modeling — to predict text well, a model benefits from building internal representations of the underlying reality the text describes — making it far more than a narrow language task.