DBSCAN — Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are densely packed together and marks points in low-density regions as outliers (label -1). Unlike K-Means, DBSCAN: (1) doesn't need K specified in advance, (2) finds arbitrarily shaped clusters, (3) automatically detects outliers. Epsilon (radius) and min_samples (density threshold) are the two parameters.
DBSCAN — Epsilon, MinSamples, and Outlier Detection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN, KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_moons, make_blobs
np.random.seed(42)
# DATASETS WHERE DBSCAN SHINES
X_moons, _ = make_moons(n_samples=400, noise=0.08, random_state=42)
X_blobs, _ = make_blobs(n_samples=400, centers=[[-2,-2],[2,2],[0,4]], cluster_std=0.5, random_state=42)
# Add outliers to blobs
outliers = np.random.uniform(-5, 7, (30, 2))
X_with_noise = np.vstack([X_blobs, outliers])
scaler = StandardScaler()
X_moons_sc = scaler.fit_transform(X_moons)
X_noise_sc = scaler.fit_transform(X_with_noise)
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
# ROW 1: K-Means vs DBSCAN on Moons
labels_km = KMeans(n_clusters=2, n_init=10, random_state=42).fit_predict(X_moons_sc)
labels_db = DBSCAN(eps=0.2, min_samples=5).fit_predict(X_moons_sc)
for ax, labels, title in zip(axes[0], [labels_km, labels_db, labels_db],
["K-Means (fails)", "DBSCAN (works!)", "DBSCAN (same)"]):
ax.scatter(X_moons[:, 0], X_moons[:, 1], c=labels, cmap="tab10", s=25, alpha=0.8)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
ax.set_title(f"{title}\n{n_clusters} clusters, {n_noise} outliers")
# ROW 2: Effect of epsilon on clustering
for ax, eps in zip(axes[1], [0.1, 0.3, 0.7]):
labels = DBSCAN(eps=eps, min_samples=5).fit_predict(X_noise_sc)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
colors = ["gray" if l == -1 else l for l in labels]
sc = ax.scatter(X_with_noise[:, 0], X_with_noise[:, 1], c=labels, cmap="tab10",
s=25, alpha=0.8, vmin=-1)
ax.set_title(f"DBSCAN eps={eps}\n{n_clusters} clusters, {n_noise} outliers (gray)")
plt.tight_layout()
plt.savefig("dbscan.png", dpi=100, bbox_inches="tight")
plt.show()
# CHOOSING EPSILON: k-distance plot
from sklearn.neighbors import NearestNeighbors
k = 5 # same as min_samples
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X_noise_sc)
distances, _ = nn.kneighbors(X_noise_sc)
distances_sorted = np.sort(distances[:, k-1])[::-1]
plt.figure(figsize=(8, 4))
plt.plot(distances_sorted, linewidth=2)
plt.ylabel("5th nearest neighbor distance")
plt.xlabel("Points (sorted by distance)")
plt.title("K-Distance Plot -- epsilon = the 'elbow'")
plt.axhline(0.3, color="red", linestyle="--", label="Suggested epsilon=0.3")
plt.legend()
plt.tight_layout()
plt.savefig("eps_selection.png", dpi=100, bbox_inches="tight")
plt.show()
print("DBSCAN parameter guide:")
params = {
"eps": "Neighborhood radius. Use k-distance plot to find elbow.",
"min_samples": "Min points in neighborhood to be 'core'. Rule: 2*n_features to 20",
"metric": "Default 'euclidean'. Use 'cosine' for text, 'precomputed' for custom",
}
for param, tip in params.items():
print(f" {param:14s}: {tip}")Tip
Tip
Practice DBSCAN DensityBased Clustering in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
F1 = harmonic mean of Precision and Recall (balanced metric)
Practice Task
Note
Practice Task — (1) Write a working example of DBSCAN DensityBased Clustering from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with DBSCAN DensityBased Clustering is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.