CLIP — Connecting Language and Vision

CLIP (Contrastive Language-Image Pre-training, 2021) learns joint embeddings where images and text with the same meaning are close in the embedding space. Trained on 400M image-text pairs from the internet, CLIP enables zero-shot image classification, semantic image search, and provides the text conditioning for Stable Diffusion.

20 min•By Priygop Team•Updated 2026

CLIP Zero-Shot Classification and Image Search

import torch
import clip
from PIL import Image
import requests
from io import BytesIO

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# CLIP -- zero-shot image classification
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def zero_shot_classify(image_path: str, class_labels: list[str]) -> dict:
    '''Classify an image using text descriptions -- NO task-specific training!'''
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)

    # Create text prompts: "a photo of a {class}"
    texts = [f"a photo of a {label}" for label in class_labels]
    text_tokens = clip.tokenize(texts).to(device)

    with torch.no_grad():
        image_features = model.encode_image(image)   # [1, 512]
        text_features  = model.encode_text(text_tokens)   # [N, 512]

        # Normalize to unit sphere
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features  = text_features  / text_features.norm(dim=-1, keepdim=True)

        # Cosine similarity between image and each text
        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

    return {label: score.item() for label, score in zip(class_labels, similarity[0])}

# Works on ANY categories without retraining!
result = zero_shot_classify("cat_photo.jpg", ["cat", "dog", "bird", "car", "horse"])
for label, score in sorted(result.items(), key=lambda x: -x[1])[:3]:
    print(f"  {label}: {score:.2%}")

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# SEMANTIC IMAGE SEARCH
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

class CLIPImageSearch:
    '''Build a searchable database of images using CLIP embeddings.'''
    def __init__(self):
        self.model, self.preprocess = clip.load("ViT-B/32", device=device)
        self.model.eval()
        self.image_embeddings = []
        self.image_paths = []

    def index_images(self, image_paths: list[str]) -> None:
        '''Compute and store CLIP embeddings for all images.'''
        for path in image_paths:
            img = self.preprocess(Image.open(path)).unsqueeze(0).to(device)
            with torch.no_grad():
                emb = self.model.encode_image(img)
                emb = emb / emb.norm(dim=-1, keepdim=True)
            self.image_embeddings.append(emb)
            self.image_paths.append(path)
        self.all_embeddings = torch.cat(self.image_embeddings, dim=0)

    def search(self, text_query: str, top_k: int = 5) -> list[tuple]:
        '''Find images matching a text description.'''
        tokens = clip.tokenize([text_query]).to(device)
        with torch.no_grad():
            text_emb = self.model.encode_text(tokens)
            text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)

        similarities = (text_emb @ self.all_embeddings.T).squeeze(0)  # [N]
        top_indices = similarities.topk(top_k).indices

        return [(self.image_paths[i.item()], similarities[i].item()) for i in top_indices]

# Usage:
searcher = CLIPImageSearch()
# searcher.index_images(["img1.jpg", "img2.jpg", ...])  # thousands of images
# results = searcher.search("a dog playing in snow")
# for path, score in results:
#     print(f"{path}: {score:.3f}")

Tip

Practice CLIP Connecting Language and Vision in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of CLIP Connecting Language and Vision from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with CLIP Connecting Language and Vision is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.

Topics in This Module