Activation Functions — Why Networks Are Non-Linear

Without activation functions, a 100-layer network collapses into a single linear transformation. Activations introduce non-linearity, letting networks learn curves, XOR, image features, and language patterns. Choosing the right activation matters enormously.

20 min•By Priygop Team•Updated 2026

All Key Activation Functions

import torch
import torch.nn.functional as F

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# ACTIVATION FUNCTIONS — properties and when to use each
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

x = torch.linspace(-3, 3, 7)

# 1. SIGMOID — squashes to (0,1)
# Use: binary classification OUTPUT layer only
# Problem: vanishing gradient (saturates at extremes → gradient ≈ 0)
sigmoid = torch.sigmoid(x)
print(f"Sigmoid: {sigmoid.tolist()}")
# [0.05, 0.12, 0.27, 0.50, 0.73, 0.88, 0.95]

# 2. TANH — squashes to (-1,1), zero-centered
# Use: RNNs, LSTMs (still suffers vanishing gradient)
tanh = torch.tanh(x)
print(f"Tanh:    {tanh.tolist()}")

# 3. ReLU — Rectified Linear Unit: max(0, x)
# Use: default for HIDDEN layers in CNNs and MLPs (since 2012)
# Pro: extremely fast, no vanishing gradient for positive inputs
# Problem: DYING ReLU — neurons with negative weights die permanently
relu = F.relu(x)
print(f"ReLU:    {relu.tolist()}")
# [-0,  -0,  -0,  0,  1,  2,  3]  ← everything < 0 becomes exactly 0

# 4. LEAKY ReLU — allows small negative gradient (prevents dying)
leaky = F.leaky_relu(x, negative_slope=0.01)
print(f"LeakyReLU: {leaky.tolist()}")

# 5. GELU — Gaussian Error Linear Unit
# Use: Transformers (BERT, GPT-2, GPT-3, GPT-4 use GELU)
# Smooth approximation of ReLU with probabilistic interpretation
gelu = F.gelu(x)
print(f"GELU:    {gelu.tolist()}")

# 6. SWISH (SiLU) — x * sigmoid(x)
# Use: EfficientNet, more modern architectures
swish = F.silu(x)  # silu = swish
print(f"Swish:   {swish.tolist()}")

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# THE VANISHING GRADIENT PROBLEM (Why sigmoid killed deep networks)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

x_extreme = torch.tensor([5.0])  # large positive value
sig_gradient = torch.sigmoid(x_extreme) * (1 - torch.sigmoid(x_extreme))
print(f"\nSigmoid gradient at x=5: {sig_gradient.item():.6f}")
# ≈ 0.006648 — nearly ZERO

# In a 10-layer network, each layer multiplied by ~0.006
# Gradient reaching layer 1 = 0.006^10 ≈ 6e-23 → TOO SMALL TO LEARN
# This is WHY deep networks (>4 layers) couldn't train before ReLU

relu_gradient_pos = 1.0  # derivative of ReLU is exactly 1 for x > 0
print(f"ReLU gradient at x=5:    {relu_gradient_pos}")
# = 1.0 → gradient flows perfectly through many layers

# DECISION GUIDE:
# Hidden layers in MLP/CNN:  ReLU (default) or GELU (modern)
# Transformer hidden layers:  GELU or SwiGLU
# Binary output:              Sigmoid
# Multi-class output:         Softmax (not actually an activation — a normalizer)
# RNN/LSTM gates:             Sigmoid + Tanh (built-in)

Tip

Practice Activation Functions Why Networks Are NonLinear in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Input → Hidden layers → Output. Train via backpropagation.

Practice Task

Note

Practice Task — (1) Write a working example of Activation Functions Why Networks Are NonLinear from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Activation Functions Why Networks Are NonLinear is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module

All Key Activation Functions

import torch
import torch.nn.functional as F

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# ACTIVATION FUNCTIONS — properties and when to use each
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

x = torch.linspace(-3, 3, 7)

# 1. SIGMOID — squashes to (0,1)
# Use: binary classification OUTPUT layer only
# Problem: vanishing gradient (saturates at extremes → gradient ≈ 0)
sigmoid = torch.sigmoid(x)
print(f"Sigmoid: {sigmoid.tolist()}")
# [0.05, 0.12, 0.27, 0.50, 0.73, 0.88, 0.95]

# 2. TANH — squashes to (-1,1), zero-centered
# Use: RNNs, LSTMs (still suffers vanishing gradient)
tanh = torch.tanh(x)
print(f"Tanh:    {tanh.tolist()}")

# 3. ReLU — Rectified Linear Unit: max(0, x)
# Use: default for HIDDEN layers in CNNs and MLPs (since 2012)
# Pro: extremely fast, no vanishing gradient for positive inputs
# Problem: DYING ReLU — neurons with negative weights die permanently
relu = F.relu(x)
print(f"ReLU:    {relu.tolist()}")
# [-0,  -0,  -0,  0,  1,  2,  3]  ← everything < 0 becomes exactly 0

# 4. LEAKY ReLU — allows small negative gradient (prevents dying)
leaky = F.leaky_relu(x, negative_slope=0.01)
print(f"LeakyReLU: {leaky.tolist()}")

# 5. GELU — Gaussian Error Linear Unit
# Use: Transformers (BERT, GPT-2, GPT-3, GPT-4 use GELU)
# Smooth approximation of ReLU with probabilistic interpretation
gelu = F.gelu(x)
print(f"GELU:    {gelu.tolist()}")

# 6. SWISH (SiLU) — x * sigmoid(x)
# Use: EfficientNet, more modern architectures
swish = F.silu(x)  # silu = swish
print(f"Swish:   {swish.tolist()}")

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# THE VANISHING GRADIENT PROBLEM (Why sigmoid killed deep networks)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

x_extreme = torch.tensor([5.0])  # large positive value
sig_gradient = torch.sigmoid(x_extreme) * (1 - torch.sigmoid(x_extreme))
print(f"\nSigmoid gradient at x=5: {sig_gradient.item():.6f}")
# ≈ 0.006648 — nearly ZERO

# In a 10-layer network, each layer multiplied by ~0.006
# Gradient reaching layer 1 = 0.006^10 ≈ 6e-23 → TOO SMALL TO LEARN
# This is WHY deep networks (>4 layers) couldn't train before ReLU

relu_gradient_pos = 1.0  # derivative of ReLU is exactly 1 for x > 0
print(f"ReLU gradient at x=5:    {relu_gradient_pos}")
# = 1.0 → gradient flows perfectly through many layers

# DECISION GUIDE:
# Hidden layers in MLP/CNN:  ReLU (default) or GELU (modern)
# Transformer hidden layers:  GELU or SwiGLU
# Binary output:              Sigmoid
# Multi-class output:         Softmax (not actually an activation — a normalizer)
# RNN/LSTM gates:             Sigmoid + Tanh (built-in)

Tip

Diagram

Loading diagram…

Input → Hidden layers → Output. Train via backpropagation.

Topics in This Module