Gaussian Mixture Models — Soft Clustering
K-Means assigns each point to exactly one cluster (hard assignment). Gaussian Mixture Models (GMM) assign each point a probability of belonging to each cluster (soft assignment). GMMs model each cluster as a Gaussian distribution and use Expectation-Maximization (EM) to fit. GMMs also work as density estimators — useful for generative modeling and anomaly detection via low-likelihood thresholds.
Gaussian Mixture Models vs K-Means
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
np.random.seed(42)
# ELONGATED CLUSTERS -- GMM handles, K-Means struggles
X, y_true = make_blobs(n_samples=400, centers=3, cluster_std=[1.5, 0.5, 1.0], random_state=42)
# Rotate to make elongated
rotation = np.array([[0.8, -0.6], [0.6, 0.8]])
X_rot = X @ rotation
X_sc = StandardScaler().fit_transform(X_rot)
# K-MEANS
km = KMeans(n_clusters=3, n_init=10, random_state=42)
km_labels = km.fit_predict(X_sc)
# GAUSSIAN MIXTURE MODEL
gmm = GaussianMixture(n_components=3, covariance_type="full", random_state=42, n_init=5)
gmm_labels = gmm.fit_predict(X_sc)
gmm_probs = gmm.predict_proba(X_sc) # soft assignments -- probabilities per cluster
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Plot K-Means
axes[0].scatter(X_sc[:, 0], X_sc[:, 1], c=km_labels, cmap="tab10", s=25, alpha=0.7)
km_centers = km.cluster_centers_
axes[0].scatter(km_centers[:, 0], km_centers[:, 1], c="red", marker="X", s=200, zorder=5)
axes[0].set_title("K-Means (hard assignment)")
# Plot GMM
axes[1].scatter(X_sc[:, 0], X_sc[:, 1], c=gmm_labels, cmap="tab10", s=25, alpha=0.7)
axes[1].set_title("GMM (hard assignment from soft)")
# Plot GMM soft probabilities (color = confidence in cluster 0)
sc = axes[2].scatter(X_sc[:, 0], X_sc[:, 1], c=gmm_probs[:, 0], cmap="coolwarm", s=25, alpha=0.7)
plt.colorbar(sc, ax=axes[2], label="P(cluster 0)")
axes[2].set_title("GMM Soft Probabilities\n(color = probability of cluster 0)")
plt.suptitle("K-Means vs GMM on Elongated Clusters", fontsize=13)
plt.tight_layout()
plt.savefig("gmm_vs_kmeans.png", dpi=100, bbox_inches="tight")
plt.show()
# GMM FOR ANOMALY DETECTION (via log-likelihood threshold)
print("GMM covariance types:")
covariance_types = {
"full": "Each component has its own covariance matrix. Most flexible. Use when clusters have different shapes.",
"tied": "All components share one covariance matrix. Good when clusters have same shape.",
"diag": "Diagonal covariances (no correlation). Fewer params, faster.",
"spherical": "One variance per component. Equivalent to K-Means with soft assignment.",
}
for cov_type, desc in covariance_types.items():
print(f" {cov_type:12s}: {desc}")
# MODEL SELECTION: BIC / AIC
print("\nBIC scores (lower = better model) for different n_components:")
bic_scores = []
for n_comp in range(2, 9):
gm = GaussianMixture(n_components=n_comp, covariance_type="full", random_state=42, n_init=3)
gm.fit(X_sc)
bic_scores.append((n_comp, gm.bic(X_sc), gm.aic(X_sc)))
print(f" n_components={n_comp}: BIC={gm.bic(X_sc):.1f} | AIC={gm.aic(X_sc):.1f}")
best_n = min(bic_scores, key=lambda x: x[1])
print(f"\nBest n_components by BIC: {best_n[0]}")Tip
Tip
Practice Gaussian Mixture Models Soft Clustering in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
F1 = harmonic mean of Precision and Recall (balanced metric)
Practice Task
Note
Practice Task — (1) Write a working example of Gaussian Mixture Models Soft Clustering from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Gaussian Mixture Models Soft Clustering is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.