RNN, LSTM & GRU — Sequence Models

Recurrent Neural Networks process sequences by maintaining a hidden state that passes information from one time step to the next. LSTMs solve the vanishing gradient problem of vanilla RNNs with three gating mechanisms (forget, input, output gates). GRUs are streamlined LSTMs with 2 gates — faster and often equally good.

25 min•By Priygop Team•Updated 2026

LSTM for Sequence Processing

import torch
import torch.nn as nn

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LSTM — Long Short-Term Memory
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Three gates control information flow:
#   Forget gate: how much of previous cell state to forget
#   Input gate:  how much new information to add to cell state
#   Output gate: what part of cell state to output as hidden state
# Cell state (c_t): carries long-term memory — the "highway"
# Hidden state (h_t): short-term output — passed to next step and as output

lstm = nn.LSTM(
    input_size=300,    # embedding dimension
    hidden_size=256,   # hidden state dimension
    num_layers=2,      # stacked LSTM layers
    batch_first=True,  # input: [batch, seq_len, features]
    dropout=0.3,       # applied between LSTM layers (not last layer)
    bidirectional=True # process sequence both forward AND backward
)

# Bidirectional LSTM doubles hidden_size because forward + backward are concatenated
# Output dim: 256 * 2 = 512

# Input sequence
batch_size, seq_len, input_size = 32, 100, 300
x = torch.randn(batch_size, seq_len, input_size)   # [32, 100, 300]

output, (h_n, c_n) = lstm(x)
# output: [32, 100, 512] — all hidden states for each timestep
# h_n:    [4, 32, 256]   — final hidden states (num_layers * 2 directions, batch, hidden)
# c_n:    [4, 32, 256]   — final cell states

print(f"LSTM output: {output.shape}")  # [32, 100, 512]
print(f"Final hidden: {h_n.shape}")    # [4, 32, 256]

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# LSTM Sentiment Classifier — full architecture
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

class LSTMSentimentClassifier(nn.Module):
    """
    Text → Embedding → BiLSTM → Max pooling → Classifier
    Many-to-one: read whole sequence, classify at the end.
    """
    def __init__(self, vocab_size: int, embed_dim: int = 300,
                 hidden_dim: int = 128, n_layers: int = 2,
                 n_classes: int = 2, dropout: float = 0.3,
                 pad_idx: int = 0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, n_layers,
            batch_first=True, dropout=dropout, bidirectional=True
        )
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dim * 2, n_classes)  # *2 for bidirectional

    def forward(self, token_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        # token_ids: [B, seq_len], attention_mask: [B, seq_len]
        embeddings = self.dropout(self.embedding(token_ids))  # [B, seq_len, embed_dim]

        # Pack padded sequence — tells LSTM to ignore padding positions
        lengths = attention_mask.sum(dim=1).cpu()
        packed  = nn.utils.rnn.pack_padded_sequence(
            embeddings, lengths, batch_first=True, enforce_sorted=False
        )
        packed_output, (h_n, _) = self.lstm(packed)
        output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
        # output: [B, seq_len, hidden*2]

        # Mask padding positions before pooling
        mask = attention_mask.unsqueeze(-1).float()
        output = output * mask

        # Max pooling over time dimension — captures most important features
        pooled = output.max(dim=1).values  # [B, hidden*2]
        return self.classifier(self.dropout(pooled))

model = LSTMSentimentClassifier(vocab_size=30_000, n_classes=2)
dummy_ids = torch.randint(1, 30_000, (8, 64))
dummy_mask = torch.ones(8, 64)
out = model(dummy_ids, dummy_mask)
print(f"\nSentiment logits: {out.shape}")  # [8, 2]

# WHY TRANSFORMERS BEAT LSTMs:
# LSTM: sequential computation (step 1 → step 2 → ... → step N) → slow to parallelize
# Transformer: ALL positions computed in PARALLEL → GPU utilization 100x
# LSTM: struggles with very long dependencies (>200 tokens)
# Transformer: attention directly connects any two positions

Tip

Practice RNN LSTM GRU Sequence Models in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of RNN LSTM GRU Sequence Models from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with RNN LSTM GRU Sequence Models is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module