Tokenization — Converting Text to Numbers

Neural networks process numbers, not words. Tokenization converts raw text into integer IDs the model can process. The choice of tokenizer — word-level, character-level, or subword (BPE) — fundamentally affects model vocabulary size, OOV handling, and downstream performance. Modern LLMs all use subword tokenization.

20 min•By Priygop Team•Updated 2026

Tokenization Strategies

from transformers import AutoTokenizer
import torch

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# THREE TOKENIZATION STRATEGIES
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

text = "Artificial intelligence is transforming the world!"

# 1. WORD-LEVEL TOKENIZATION
# Problem: large vocabulary, OOV (out-of-vocabulary) words for rare words, typos
word_tokens = text.lower().split()
print(f"Word tokens: {word_tokens}")
# ['artificial', 'intelligence', 'is', 'transforming', 'the', 'world!']
# Problem: 'world!' ≠ 'world' — punctuation causes OOV issues

# 2. CHARACTER-LEVEL TOKENIZATION
# Good: no OOV. Bad: sequences are very long, harder to learn word meaning
char_tokens = list(text.lower())
print(f"Char tokens: {char_tokens[:10]}...")  # ['a', 'r', 't', 'i', ...]
print(f"Sequence length: {len(char_tokens)}")  # 51 chars vs 7 words — too long!

# 3. SUBWORD TOKENIZATION — BPE (Byte Pair Encoding)
# Best of both: handles OOV by splitting unknown words into known subwords
# "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]
# Used by: GPT-4, BERT, LLaMA — all modern LLMs

# GPT-2/GPT-4 tokenizer (tiktoken-based)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer(text)
print(f"\nBPE token IDs: {tokens['input_ids']}")
print(f"BPE tokens: {[tokenizer.decode([i]) for i in tokens['input_ids']]}")

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# BATCH TOKENIZATION with padding and truncation
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
texts = [
    "I love this movie!",
    "The product quality is absolutely terrible and I want a refund.",
    "OK.",
]

# Padding: add [PAD] tokens to match longest sequence in batch
# Truncation: cut sequences longer than max_length
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = bert_tokenizer(
    texts,
    padding=True,          # pad to longest in batch
    truncation=True,       # truncate to max_length
    max_length=128,
    return_tensors="pt",   # return PyTorch tensors
)

print(f"\nBERT Tokenization:")
print(f"input_ids shape:      {encoded['input_ids'].shape}")       # [3, max_len]
print(f"attention_mask shape: {encoded['attention_mask'].shape}")  # [3, max_len]
# attention_mask: 1 = real token, 0 = padding (model ignores padded positions)

print(f"\nSpecial tokens: [CLS]={bert_tokenizer.cls_token_id}, [SEP]={bert_tokenizer.sep_token_id}, [PAD]={bert_tokenizer.pad_token_id}")
# BERT wraps every input: [CLS] + tokens + [SEP] + [PAD]...

# Vocabulary sizes:
tokenizers_vocab = {
    "BERT-base": 30_522,   # WordPiece vocabulary
    "GPT-2":     50_257,   # BPE vocabulary
    "LLaMA 3":  128_256,   # larger vocabulary = better multilingual
    "T5":        32_100,   # SentencePiece
}
for name, size in tokenizers_vocab.items():
    print(f"  {name:15s}: {size:,} tokens")

Tip

Practice Tokenization Converting Text to Numbers in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Modern NLP = Transformer-based. Pre-train, then fine-tune.

Practice Task

Note

Practice Task — (1) Write a working example of Tokenization Converting Text to Numbers from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Tokenization Converting Text to Numbers is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module