RNN, LSTM & GRU — Sequence Models
Recurrent Neural Networks process sequences by maintaining a hidden state that passes information from one time step to the next. LSTMs solve the vanishing gradient problem of vanilla RNNs with three gating mechanisms (forget, input, output gates). GRUs are streamlined LSTMs with 2 gates — faster and often equally good.
LSTM for Sequence Processing
import torch
import torch.nn as nn
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LSTM — Long Short-Term Memory
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Three gates control information flow:
# Forget gate: how much of previous cell state to forget
# Input gate: how much new information to add to cell state
# Output gate: what part of cell state to output as hidden state
# Cell state (c_t): carries long-term memory — the "highway"
# Hidden state (h_t): short-term output — passed to next step and as output
lstm = nn.LSTM(
input_size=300, # embedding dimension
hidden_size=256, # hidden state dimension
num_layers=2, # stacked LSTM layers
batch_first=True, # input: [batch, seq_len, features]
dropout=0.3, # applied between LSTM layers (not last layer)
bidirectional=True # process sequence both forward AND backward
)
# Bidirectional LSTM doubles hidden_size because forward + backward are concatenated
# Output dim: 256 * 2 = 512
# Input sequence
batch_size, seq_len, input_size = 32, 100, 300
x = torch.randn(batch_size, seq_len, input_size) # [32, 100, 300]
output, (h_n, c_n) = lstm(x)
# output: [32, 100, 512] — all hidden states for each timestep
# h_n: [4, 32, 256] — final hidden states (num_layers * 2 directions, batch, hidden)
# c_n: [4, 32, 256] — final cell states
print(f"LSTM output: {output.shape}") # [32, 100, 512]
print(f"Final hidden: {h_n.shape}") # [4, 32, 256]
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LSTM Sentiment Classifier — full architecture
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
class LSTMSentimentClassifier(nn.Module):
"""
Text → Embedding → BiLSTM → Max pooling → Classifier
Many-to-one: read whole sequence, classify at the end.
"""
def __init__(self, vocab_size: int, embed_dim: int = 300,
hidden_dim: int = 128, n_layers: int = 2,
n_classes: int = 2, dropout: float = 0.3,
pad_idx: int = 0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
self.lstm = nn.LSTM(
embed_dim, hidden_dim, n_layers,
batch_first=True, dropout=dropout, bidirectional=True
)
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(hidden_dim * 2, n_classes) # *2 for bidirectional
def forward(self, token_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
# token_ids: [B, seq_len], attention_mask: [B, seq_len]
embeddings = self.dropout(self.embedding(token_ids)) # [B, seq_len, embed_dim]
# Pack padded sequence — tells LSTM to ignore padding positions
lengths = attention_mask.sum(dim=1).cpu()
packed = nn.utils.rnn.pack_padded_sequence(
embeddings, lengths, batch_first=True, enforce_sorted=False
)
packed_output, (h_n, _) = self.lstm(packed)
output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
# output: [B, seq_len, hidden*2]
# Mask padding positions before pooling
mask = attention_mask.unsqueeze(-1).float()
output = output * mask
# Max pooling over time dimension — captures most important features
pooled = output.max(dim=1).values # [B, hidden*2]
return self.classifier(self.dropout(pooled))
model = LSTMSentimentClassifier(vocab_size=30_000, n_classes=2)
dummy_ids = torch.randint(1, 30_000, (8, 64))
dummy_mask = torch.ones(8, 64)
out = model(dummy_ids, dummy_mask)
print(f"\nSentiment logits: {out.shape}") # [8, 2]
# WHY TRANSFORMERS BEAT LSTMs:
# LSTM: sequential computation (step 1 → step 2 → ... → step N) → slow to parallelize
# Transformer: ALL positions computed in PARALLEL → GPU utilization 100x
# LSTM: struggles with very long dependencies (>200 tokens)
# Transformer: attention directly connects any two positionsTip
Tip
Practice RNN LSTM GRU Sequence Models in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of RNN LSTM GRU Sequence Models from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with RNN LSTM GRU Sequence Models is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.