CLIP — Connecting Language and Vision
CLIP (Contrastive Language-Image Pre-training, 2021) learns joint embeddings where images and text with the same meaning are close in the embedding space. Trained on 400M image-text pairs from the internet, CLIP enables zero-shot image classification, semantic image search, and provides the text conditioning for Stable Diffusion.
CLIP Zero-Shot Classification and Image Search
import torch
import clip
from PIL import Image
import requests
from io import BytesIO
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# CLIP -- zero-shot image classification
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def zero_shot_classify(image_path: str, class_labels: list[str]) -> dict:
'''Classify an image using text descriptions -- NO task-specific training!'''
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# Create text prompts: "a photo of a {class}"
texts = [f"a photo of a {label}" for label in class_labels]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
image_features = model.encode_image(image) # [1, 512]
text_features = model.encode_text(text_tokens) # [N, 512]
# Normalize to unit sphere
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# Cosine similarity between image and each text
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
return {label: score.item() for label, score in zip(class_labels, similarity[0])}
# Works on ANY categories without retraining!
result = zero_shot_classify("cat_photo.jpg", ["cat", "dog", "bird", "car", "horse"])
for label, score in sorted(result.items(), key=lambda x: -x[1])[:3]:
print(f" {label}: {score:.2%}")
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# SEMANTIC IMAGE SEARCH
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
class CLIPImageSearch:
'''Build a searchable database of images using CLIP embeddings.'''
def __init__(self):
self.model, self.preprocess = clip.load("ViT-B/32", device=device)
self.model.eval()
self.image_embeddings = []
self.image_paths = []
def index_images(self, image_paths: list[str]) -> None:
'''Compute and store CLIP embeddings for all images.'''
for path in image_paths:
img = self.preprocess(Image.open(path)).unsqueeze(0).to(device)
with torch.no_grad():
emb = self.model.encode_image(img)
emb = emb / emb.norm(dim=-1, keepdim=True)
self.image_embeddings.append(emb)
self.image_paths.append(path)
self.all_embeddings = torch.cat(self.image_embeddings, dim=0)
def search(self, text_query: str, top_k: int = 5) -> list[tuple]:
'''Find images matching a text description.'''
tokens = clip.tokenize([text_query]).to(device)
with torch.no_grad():
text_emb = self.model.encode_text(tokens)
text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
similarities = (text_emb @ self.all_embeddings.T).squeeze(0) # [N]
top_indices = similarities.topk(top_k).indices
return [(self.image_paths[i.item()], similarities[i].item()) for i in top_indices]
# Usage:
searcher = CLIPImageSearch()
# searcher.index_images(["img1.jpg", "img2.jpg", ...]) # thousands of images
# results = searcher.search("a dog playing in snow")
# for path, score in results:
# print(f"{path}: {score:.3f}")Tip
Tip
Practice CLIP Connecting Language and Vision in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of CLIP Connecting Language and Vision from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with CLIP Connecting Language and Vision is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.