Convolution — The Core Operation of Computer Vision

Convolution detects local patterns in an image (edges, textures, shapes) by sliding a small filter (kernel) across the image and computing dot products at each position. Unlike a fully connected layer that looks at everything at once, convolution exploits local connectivity and weight sharing — making it massively more efficient for images.

25 min•By Priygop Team•Updated 2026

Convolution Operation — From Math to PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# What does a convolution kernel detect?
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# SOBEL EDGE DETECTOR — detects horizontal edges manually
sobel_h = torch.tensor([
    [-1., -2., -1.],
    [ 0.,  0.,  0.],
    [ 1.,  2.,  1.]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # [1,1,3,3]
# Apply: high activation where intensity changes vertically (horizontal edges)

# SOBEL vertical edges
sobel_v = torch.tensor([
    [-1., 0., 1.],
    [-2., 0., 2.],
    [-1., 0., 1.]
], dtype=torch.float32).unsqueeze(0).unsqueeze(0)

# In CNNs: kernels are LEARNED from data, not hand-crafted
# Early layers learn edge/gradient detectors automatically
# Deeper layers learn textures, object parts, full objects

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# nn.Conv2d — the fundamental CNN building block
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
conv = nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=64,  # 64 different kernels → 64 feature maps
    kernel_size=3,    # 3x3 kernel (most common)
    stride=1,         # move 1 pixel at a time
    padding=1,        # same padding: output H,W same as input H,W
    bias=False,       # use bias=False when followed by BatchNorm
)

print(f"Conv weights shape: {conv.weight.shape}")  # [64, 3, 3, 3]
# 64 filters, each 3 channels (RGB) × 3×3 pixels = 64 × 3 × 9 = 1,728 params

# Input: batch of 4 RGB images, 224×224
x = torch.randn(4, 3, 224, 224)
out = conv(x)
print(f"Input:  {x.shape}")    # [4, 3, 224, 224]
print(f"Output: {out.shape}")  # [4, 64, 224, 224] (same H,W due to padding=1)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# POOLING — reduce spatial dimensions
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

maxpool = nn.MaxPool2d(kernel_size=2, stride=2)  # halves H and W
avgpool = nn.AdaptiveAvgPool2d((1, 1))            # global average pool → [B, C, 1, 1]

after_pool = maxpool(out)
print(f"After MaxPool: {after_pool.shape}")  # [4, 64, 112, 112]

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# RECEPTIVE FIELD — how much of the input each neuron sees
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# After n conv layers with 3×3 kernels (no stride):
# Layer 1: receptive field = 3×3   (sees 3×3 pixels of input)
# Layer 2: receptive field = 5×5   (+2 each side per layer)
# Layer 5: receptive field = 11×11
# Layer 10: receptive field = 21×21
# Strided convolutions grow the receptive field faster

# WHY SIZE MATTERS:
# - Small receptive field → can only detect local textures
# - Large receptive field → can detect global structure, object parts
# ResNet-50, after 50 layers, has receptive field covering full image

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# DEPTHWISE SEPARABLE CONVOLUTION — MobileNet trick
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Regular Conv: in_c × out_c × k × k parameters
# Depthwise Sep: in_c × k × k + in_c × out_c params (8-9x fewer)

class DepthwiseSeparableConv(nn.Module):
    """Efficient convolution used in MobileNet, EfficientNet."""
    def __init__(self, in_c: int, out_c: int, stride: int = 1):
        super().__init__()
        self.depthwise = nn.Conv2d(in_c, in_c, 3, stride=stride, padding=1, groups=in_c)
        self.pointwise = nn.Conv2d(in_c, out_c, 1)  # 1×1 conv

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.pointwise(self.depthwise(x))

Tip

Practice Convolution The Core Operation of Computer Vision in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of Convolution The Core Operation of Computer Vision from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Convolution The Core Operation of Computer Vision is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module