Transformers & Attention Mechanism
Deep dive into the Transformer architecture — the foundation of modern NLP — and understand self-attention, multi-head attention, and positional encoding.
Why Transformers Changed Everything
Before transformers, NLP relied on RNNs (Recurrent Neural Networks) and LSTMs that processed text sequentially — word by word. This was slow (no parallelization), struggled with long-range dependencies (forgetting information from 100 words earlier), and had vanishing gradient problems. The 2017 paper 'Attention Is All You Need' by Vaswani et al. introduced the Transformer, which processes all words simultaneously using self-attention. This enables massive parallelization (training on GPUs is dramatically faster), captures long-range dependencies easily (every word attends to every other word), and scales to billions of parameters. Transformers are now the foundation of virtually every state-of-the-art NLP model: BERT, GPT, T5, PaLM, LLaMA, and every large language model.
Transformer Architecture Components
- Self-Attention: Each word creates Query (Q), Key (K), and Value (V) vectors. Attention score = softmax(QK^T / √d) × V. This lets each word 'attend to' all other words and determine which are most relevant
- Multi-Head Attention: Runs self-attention multiple times in parallel (typically 8-16 heads) — each head learns different relationship patterns (syntactic, semantic, positional)
- Positional Encoding: Since transformers process all words simultaneously, they need position information. Uses sine/cosine functions or learned embeddings to encode word position
- Feed-Forward Network: After attention, each position passes through a 2-layer MLP — adds non-linearity and processes individual representations
- Residual Connections: Add the input back to the output of each sublayer (attention, FFN) — enables deeper networks without vanishing gradients
- Layer Normalization: Normalizes across features for each position — stabilizes training and improves convergence
- Encoder-Decoder: Original transformer has both — encoder processes input, decoder generates output. BERT uses encoder-only, GPT uses decoder-only
Key Transformer Models
- BERT (2018): Bidirectional Encoder — pre-trained by predicting masked words and next sentence. Fine-tune for classification, NER, QA. 110M-340M parameters
- GPT-2/3/4 (2019-2023): Decoder-only, autoregressive — pre-trained to predict next word. GPT-3 has 175B parameters. Few-shot learning without fine-tuning
- T5 (2019): Text-to-Text Transfer Transformer — frames all NLP as text generation. 'Translate English to French: The house is blue' → 'La maison est bleue'
- RoBERTa (2019): Optimized BERT training — more data, longer training, dynamic masking. Consistently outperforms BERT on benchmarks
- LLaMA/Mistral (2023-2024): Open-source alternatives to GPT — LLaMA 2 (7B-70B) and Mistral (7B) achieve competitive performance with full transparency
- Gemini (2024): Google's multimodal model — processes text, images, audio, and video natively. Powers Google's AI products