How LLMs Work — Pretraining & Autoregressive Generation
LLMs are trained to predict the next token given all previous tokens (causal language modeling). This simple objective, applied at trillion-token scale, gives rise to emergent abilities: reasoning, coding, translation, creativity — all from one training objective.
LLM Pretraining & Token Generation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LLM PRETRAINING OBJECTIVE: Next Token Prediction
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Input: "The cat sat on the"
# Target: "cat sat on the mat" (shifted right by 1)
# Loss: CrossEntropyLoss averaged over all positions
# This forces the model to learn EVERYTHING about language:
# - Grammar ("The [NOUN]" not "The [VERB]")
# - Facts ("Paris is the capital of [France]")
# - Reasoning ("If A=B and B=C then A=[C]")
# - Code ("def sum(lst): return [sum(lst)]")
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# AUTOREGRESSIVE GENERATION — token by token
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()
prompt = "Artificial intelligence is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# Step-by-step autoregressive generation (what model.generate() does internally)
generated = input_ids.clone()
with torch.no_grad():
for step in range(20):
outputs = model(generated)
logits = outputs.logits[:, -1, :] # logits for LAST position only: [1, vocab_size]
probs = torch.softmax(logits, dim=-1) # probability over 50257 tokens
# Greedy: always pick the most likely next token
next_token = probs.argmax(dim=-1, keepdim=True) # [1, 1]
# Append to sequence
generated = torch.cat([generated, next_token], dim=1) # [1, len+1]
decoded_next = tokenizer.decode(next_token[0])
print(f" Step {step+1:2d}: next token = '{decoded_next}' (p={probs.max():.2%})")
final_text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(f"\nGenerated: {final_text}")
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# KEY VOCABULARY SIZES
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
vocab_sizes = {
"GPT-2": 50_257,
"GPT-3/4": 100_277, # cl100k_base tiktoken
"LLaMA 3": 128_256,
"BERT": 30_522,
}
for model_name, vocab_size in vocab_sizes.items():
print(f" {model_name:12s}: {vocab_size:,} tokens")Tip
Tip
Practice How LLMs Work Pretraining Autoregressive Generation in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Modern NLP = Transformer-based. Pre-train, then fine-tune.
Practice Task
Note
Practice Task — (1) Write a working example of How LLMs Work Pretraining Autoregressive Generation from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with How LLMs Work Pretraining Autoregressive Generation is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.