Image Classification with CNNs
Understand how Convolutional Neural Networks (CNNs) process visual data and learn to build image classification models from scratch using PyTorch and TensorFlow.
How CNNs See Images
A Convolutional Neural Network (CNN) is specifically designed to process visual data by learning spatial hierarchies of features. Unlike fully connected networks that flatten images into 1D vectors (losing spatial information), CNNs preserve the 2D structure through three key operations: Convolution (sliding filters/kernels across the image to detect features like edges, textures, and patterns), Pooling (reducing spatial dimensions while keeping important features — max pooling takes the strongest activation in each region), and Fully Connected layers (combining extracted features for final classification). Early layers detect low-level features (edges, corners, colors), middle layers detect mid-level features (textures, shapes, parts), and deep layers detect high-level features (faces, objects, scenes). This hierarchical learning is what makes CNNs so powerful for vision tasks — they automatically learn what features matter, unlike traditional computer vision which required hand-crafted features.
CNN Architecture Components
- Convolutional Layer: Applies learnable filters (3×3, 5×5) across the image — each filter detects a specific feature. A layer with 64 filters learns 64 different features
- Activation (ReLU): Introduces non-linearity — ReLU(x) = max(0, x). Without activation, stacking layers would be equivalent to a single linear transformation
- Pooling Layer: Reduces spatial dimensions (typically 2×2 max pooling) — halves width and height while retaining dominant features. Provides translation invariance
- Batch Normalization: Normalizes layer outputs to stabilize training — reduces internal covariate shift, allows higher learning rates, acts as light regularization
- Dropout: Randomly zeroes neurons during training (typically 25-50%) — prevents overfitting by forcing the network to be redundant
- Fully Connected Layer: Flattens feature maps and classifies — the final FC layer has as many neurons as classes (e.g., 1000 for ImageNet)
- Softmax: Converts final layer outputs to probabilities — each class gets a probability between 0 and 1, all summing to 1
Famous CNN Architectures
- LeNet-5 (1998): The pioneer — 5 layers, designed for handwritten digit recognition. Simple but established the CNN paradigm
- AlexNet (2012): Won ImageNet with 15.3% error — proved deep CNNs work at scale. Used ReLU, dropout, and GPU training
- VGGNet (2014): Showed deeper is better — 16-19 layers using only 3×3 convolutions. Simple, uniform architecture
- GoogLeNet/Inception (2014): Introduced Inception modules — parallel convolutions at multiple scales. 22 layers but fewer parameters than AlexNet
- ResNet (2015): Solved the vanishing gradient problem with skip connections — enabled 152+ layer networks. Top-1 error: 3.57%
- EfficientNet (2019): Optimal scaling of depth, width, and resolution — achieves best accuracy with fewest parameters using compound scaling
- Vision Transformer (ViT, 2020): Applied transformer architecture to images — splits images into patches and processes them like text tokens