K-Means Clustering — Grouping Without Labels
K-Means partitions data into K clusters by iteratively assigning each point to its nearest centroid and recomputing centroids until convergence. The algorithm minimizes within-cluster sum of squares (inertia). K must be chosen in advance — use the Elbow method or Silhouette score. K-Means assumes spherical clusters of similar size; it fails on elongated or non-convex shapes.
K-Means — Elbow Method, Silhouette, and Limitation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.datasets import make_blobs, make_moons
np.random.seed(42)
# GENERATE BLOBS
X_blobs, y_true = make_blobs(n_samples=500, centers=4, cluster_std=1.0, random_state=42)
# SCALE FEATURES BEFORE CLUSTERING
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_blobs)
# FIND OPTIMAL K: ELBOW METHOD
inertias = []
silhouette_scores = []
k_range = range(2, 11)
for k in k_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
inertias.append(km.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, labels))
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
axes[0].plot(k_range, inertias, "bo-", linewidth=2, markersize=6)
axes[0].set_xlabel("Number of Clusters (K)")
axes[0].set_ylabel("Inertia (within-cluster SS)")
axes[0].set_title("Elbow Method -- look for the 'elbow'")
axes[0].axvline(4, color="red", linestyle="--", alpha=0.7, label="True K=4")
axes[0].legend()
axes[1].plot(k_range, silhouette_scores, "ro-", linewidth=2, markersize=6)
axes[1].set_xlabel("Number of Clusters (K)")
axes[1].set_ylabel("Silhouette Score (higher = better)")
axes[1].set_title("Silhouette Score -- peak = optimal K")
axes[1].axvline(k_range[np.argmax(silhouette_scores)], color="red", linestyle="--",
label=f"Best K={k_range[np.argmax(silhouette_scores)]}")
axes[1].legend()
plt.tight_layout()
plt.savefig("kmeans_elbow.png", dpi=100, bbox_inches="tight")
plt.show()
# FIT OPTIMAL K-MEANS
km_final = KMeans(n_clusters=4, n_init=10, random_state=42)
labels = km_final.fit_predict(X_scaled)
print(f"K-Means (K=4): Silhouette = {silhouette_score(X_scaled, labels):.4f}")
print(f"Cluster sizes: {pd.Series(labels).value_counts().sort_index().to_dict()}")
# CLUSTER STATISTICS
cluster_df = pd.DataFrame(X_blobs, columns=["feature_1", "feature_2"])
cluster_df["cluster"] = labels
print("\nCluster centers (original scale):")
print(cluster_df.groupby("cluster")[["feature_1", "feature_2"]].mean().round(2))
# WHERE K-MEANS FAILS -- non-spherical shapes
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
X_moons_sc = StandardScaler().fit_transform(X_moons)
labels_moons = KMeans(n_clusters=2, n_init=10, random_state=42).fit_predict(X_moons_sc)
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap="RdBu", s=30, alpha=0.8)
ax.set_title(f"K-Means on Moons -- fails!\nSilhouette={silhouette_score(X_moons_sc, labels_moons):.3f}")
plt.tight_layout()
plt.savefig("kmeans_fails.png", dpi=100, bbox_inches="tight")
plt.show()Tip
Tip
Practice KMeans Clustering Grouping Without Labels in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
θ = θ - α × ∇L(θ). Too high α = diverge. Too low = slow.
Practice Task
Note
Practice Task — (1) Write a working example of KMeans Clustering Grouping Without Labels from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with KMeans Clustering Grouping Without Labels is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.