Multi-Head Attention — Parallel Attention Subspaces

Multi-head attention runs h parallel attention heads, each in a lower-dimensional subspace. This allows the model to simultaneously attend to information from different representation subspaces at different positions — one head might track syntactic dependencies, another coreference, another semantic similarity.

25 min•By Priygop Team•Updated 2026

Multi-Head Attention Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    """
    Split embedding into H heads, run attention in parallel, concatenate.
    BERT-base: 12 heads, dim=768 → each head sees 64 dims
    GPT-3: 96 heads, dim=12288 → each head sees 128 dims
    """
    def __init__(self, embed_dim: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert embed_dim % n_heads == 0, "embed_dim must be divisible by n_heads"
        self.n_heads = n_heads
        self.d_k = embed_dim // n_heads  # dimension per head

        # ONE linear layer for all heads (more efficient than n_heads separate layers)
        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_o = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
                mask: torch.Tensor = None) -> tuple:
        B, T_q = q.shape[:2]
        _, T_k = k.shape[:2]

        # Project and reshape to separate heads
        # [B, T, D] → [B, T, H, d_k] → [B, H, T, d_k]
        Q = self.W_q(q).view(B, T_q, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(k).view(B, T_k, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(v).view(B, T_k, self.n_heads, self.d_k).transpose(1, 2)

        # Attention for ALL heads in parallel (batch matmul)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        # scores: [B, H, T_q, T_k] — H independent attention matrices

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))

        weights = F.softmax(scores, dim=-1)
        weights = self.dropout(weights)

        # Weighted sum + reshape back
        out = torch.matmul(weights, V)       # [B, H, T_q, d_k]
        out = out.transpose(1, 2)            # [B, T_q, H, d_k]
        out = out.contiguous().view(B, T_q, -1)  # [B, T_q, embed_dim] — concatenate heads
        return self.W_o(out), weights

# Different types of attention in Transformers:
attention_types = {
    "Self-attention (encoder)": "Q=K=V=same input. Each position attends to all positions. Used in BERT encoder.",
    "Causal self-attention (decoder)": "Q=K=V=same. Each position only attends to PAST positions. Used in GPT.",
    "Cross-attention (encoder-decoder)": "Q from decoder, K and V from encoder. Decoder 'reads' the encoded input. Used in T5, translation models.",
}

mha = MultiHeadAttention(embed_dim=512, n_heads=8)
x = torch.randn(4, 20, 512)

# Self-attention
out, w = mha(x, x, x)
print(f"Multi-head self-attention output: {out.shape}")  # [4, 20, 512]
print(f"Attention weights: {w.shape}")   # [4, 8, 20, 20] — 8 heads, 20×20 attention matrix

Tip

Practice MultiHead Attention Parallel Attention Subspaces in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of MultiHead Attention Parallel Attention Subspaces from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with MultiHead Attention Parallel Attention Subspaces is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module