How LLMs Work — Pretraining & Autoregressive Generation

LLMs are trained to predict the next token given all previous tokens (causal language modeling). This simple objective, applied at trillion-token scale, gives rise to emergent abilities: reasoning, coding, translation, creativity — all from one training objective.

20 min•By Priygop Team•Updated 2026

LLM Pretraining & Token Generation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LLM PRETRAINING OBJECTIVE: Next Token Prediction
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Input:  "The cat sat on the"
# Target: "cat sat on the mat"  (shifted right by 1)
# Loss:   CrossEntropyLoss averaged over all positions

# This forces the model to learn EVERYTHING about language:
# - Grammar ("The [NOUN]" not "The [VERB]")
# - Facts ("Paris is the capital of [France]")
# - Reasoning ("If A=B and B=C then A=[C]")
# - Code ("def sum(lst): return [sum(lst)]")

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# AUTOREGRESSIVE GENERATION — token by token
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

prompt = "Artificial intelligence is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Step-by-step autoregressive generation (what model.generate() does internally)
generated = input_ids.clone()

with torch.no_grad():
    for step in range(20):
        outputs = model(generated)
        logits = outputs.logits[:, -1, :]      # logits for LAST position only: [1, vocab_size]
        probs = torch.softmax(logits, dim=-1)  # probability over 50257 tokens

        # Greedy: always pick the most likely next token
        next_token = probs.argmax(dim=-1, keepdim=True)  # [1, 1]

        # Append to sequence
        generated = torch.cat([generated, next_token], dim=1)  # [1, len+1]

        decoded_next = tokenizer.decode(next_token[0])
        print(f"  Step {step+1:2d}: next token = '{decoded_next}' (p={probs.max():.2%})")

final_text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(f"\nGenerated: {final_text}")

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# KEY VOCABULARY SIZES
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
vocab_sizes = {
    "GPT-2":      50_257,
    "GPT-3/4":    100_277,  # cl100k_base tiktoken
    "LLaMA 3":    128_256,
    "BERT":        30_522,
}
for model_name, vocab_size in vocab_sizes.items():
    print(f"  {model_name:12s}: {vocab_size:,} tokens")

Tip

Practice How LLMs Work Pretraining Autoregressive Generation in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Modern NLP = Transformer-based. Pre-train, then fine-tune.

Practice Task

Note

Practice Task — (1) Write a working example of How LLMs Work Pretraining Autoregressive Generation from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with How LLMs Work Pretraining Autoregressive Generation is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module