Multi-Head Attention — Parallel Attention Subspaces
Multi-head attention runs h parallel attention heads, each in a lower-dimensional subspace. This allows the model to simultaneously attend to information from different representation subspaces at different positions — one head might track syntactic dependencies, another coreference, another semantic similarity.
Multi-Head Attention Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
"""
Split embedding into H heads, run attention in parallel, concatenate.
BERT-base: 12 heads, dim=768 → each head sees 64 dims
GPT-3: 96 heads, dim=12288 → each head sees 128 dims
"""
def __init__(self, embed_dim: int, n_heads: int, dropout: float = 0.1):
super().__init__()
assert embed_dim % n_heads == 0, "embed_dim must be divisible by n_heads"
self.n_heads = n_heads
self.d_k = embed_dim // n_heads # dimension per head
# ONE linear layer for all heads (more efficient than n_heads separate layers)
self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
mask: torch.Tensor = None) -> tuple:
B, T_q = q.shape[:2]
_, T_k = k.shape[:2]
# Project and reshape to separate heads
# [B, T, D] → [B, T, H, d_k] → [B, H, T, d_k]
Q = self.W_q(q).view(B, T_q, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(k).view(B, T_k, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(v).view(B, T_k, self.n_heads, self.d_k).transpose(1, 2)
# Attention for ALL heads in parallel (batch matmul)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
# scores: [B, H, T_q, T_k] — H independent attention matrices
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
weights = self.dropout(weights)
# Weighted sum + reshape back
out = torch.matmul(weights, V) # [B, H, T_q, d_k]
out = out.transpose(1, 2) # [B, T_q, H, d_k]
out = out.contiguous().view(B, T_q, -1) # [B, T_q, embed_dim] — concatenate heads
return self.W_o(out), weights
# Different types of attention in Transformers:
attention_types = {
"Self-attention (encoder)": "Q=K=V=same input. Each position attends to all positions. Used in BERT encoder.",
"Causal self-attention (decoder)": "Q=K=V=same. Each position only attends to PAST positions. Used in GPT.",
"Cross-attention (encoder-decoder)": "Q from decoder, K and V from encoder. Decoder 'reads' the encoded input. Used in T5, translation models.",
}
mha = MultiHeadAttention(embed_dim=512, n_heads=8)
x = torch.randn(4, 20, 512)
# Self-attention
out, w = mha(x, x, x)
print(f"Multi-head self-attention output: {out.shape}") # [4, 20, 512]
print(f"Attention weights: {w.shape}") # [4, 8, 20, 20] — 8 heads, 20×20 attention matrixTip
Tip
Practice MultiHead Attention Parallel Attention Subspaces in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of MultiHead Attention Parallel Attention Subspaces from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with MultiHead Attention Parallel Attention Subspaces is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.