Tokenization — Converting Text to Numbers
Neural networks process numbers, not words. Tokenization converts raw text into integer IDs the model can process. The choice of tokenizer — word-level, character-level, or subword (BPE) — fundamentally affects model vocabulary size, OOV handling, and downstream performance. Modern LLMs all use subword tokenization.
Tokenization Strategies
from transformers import AutoTokenizer
import torch
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# THREE TOKENIZATION STRATEGIES
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
text = "Artificial intelligence is transforming the world!"
# 1. WORD-LEVEL TOKENIZATION
# Problem: large vocabulary, OOV (out-of-vocabulary) words for rare words, typos
word_tokens = text.lower().split()
print(f"Word tokens: {word_tokens}")
# ['artificial', 'intelligence', 'is', 'transforming', 'the', 'world!']
# Problem: 'world!' ≠ 'world' — punctuation causes OOV issues
# 2. CHARACTER-LEVEL TOKENIZATION
# Good: no OOV. Bad: sequences are very long, harder to learn word meaning
char_tokens = list(text.lower())
print(f"Char tokens: {char_tokens[:10]}...") # ['a', 'r', 't', 'i', ...]
print(f"Sequence length: {len(char_tokens)}") # 51 chars vs 7 words — too long!
# 3. SUBWORD TOKENIZATION — BPE (Byte Pair Encoding)
# Best of both: handles OOV by splitting unknown words into known subwords
# "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]
# Used by: GPT-4, BERT, LLaMA — all modern LLMs
# GPT-2/GPT-4 tokenizer (tiktoken-based)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer(text)
print(f"\nBPE token IDs: {tokens['input_ids']}")
print(f"BPE tokens: {[tokenizer.decode([i]) for i in tokens['input_ids']]}")
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# BATCH TOKENIZATION with padding and truncation
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
texts = [
"I love this movie!",
"The product quality is absolutely terrible and I want a refund.",
"OK.",
]
# Padding: add [PAD] tokens to match longest sequence in batch
# Truncation: cut sequences longer than max_length
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = bert_tokenizer(
texts,
padding=True, # pad to longest in batch
truncation=True, # truncate to max_length
max_length=128,
return_tensors="pt", # return PyTorch tensors
)
print(f"\nBERT Tokenization:")
print(f"input_ids shape: {encoded['input_ids'].shape}") # [3, max_len]
print(f"attention_mask shape: {encoded['attention_mask'].shape}") # [3, max_len]
# attention_mask: 1 = real token, 0 = padding (model ignores padded positions)
print(f"\nSpecial tokens: [CLS]={bert_tokenizer.cls_token_id}, [SEP]={bert_tokenizer.sep_token_id}, [PAD]={bert_tokenizer.pad_token_id}")
# BERT wraps every input: [CLS] + tokens + [SEP] + [PAD]...
# Vocabulary sizes:
tokenizers_vocab = {
"BERT-base": 30_522, # WordPiece vocabulary
"GPT-2": 50_257, # BPE vocabulary
"LLaMA 3": 128_256, # larger vocabulary = better multilingual
"T5": 32_100, # SentencePiece
}
for name, size in tokenizers_vocab.items():
print(f" {name:15s}: {size:,} tokens")Tip
Tip
Practice Tokenization Converting Text to Numbers in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Modern NLP = Transformer-based. Pre-train, then fine-tune.
Practice Task
Note
Practice Task — (1) Write a working example of Tokenization Converting Text to Numbers from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Tokenization Converting Text to Numbers is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.