Anomaly Detection — Isolation Forest & LOF
Anomaly detection finds unusual observations without needing labeled anomaly examples. Isolation Forest isolates anomalies by randomly partitioning features — anomalies require fewer partitions to isolate (short paths in random trees). Local Outlier Factor compares the density of a point to its neighbors — anomalies have much lower density. Used for fraud, manufacturing defects, network intrusion, and sensor malfunction detection.
Isolation Forest and Local Outlier Factor
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
np.random.seed(42)
# SIMULATE SENSOR DATA WITH ANOMALIES
n_normal = 500
n_anomaly = 25
# Normal operational data
normal_data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], n_normal)
# Anomalies: equipment failures (far from normal cluster)
anomaly_data = np.random.uniform(-5, 5, (n_anomaly, 2))
anomaly_data = anomaly_data[np.linalg.norm(anomaly_data, axis=1) > 3][:n_anomaly]
X = np.vstack([normal_data, anomaly_data[:n_anomaly]])
y_true = np.array([1] * n_normal + [-1] * n_anomaly) # 1=normal, -1=anomaly
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
# ISOLATION FOREST
iso_forest = IsolationForest(n_estimators=200, contamination=0.05, random_state=42)
y_if = iso_forest.fit_predict(X_sc) # 1=normal, -1=anomaly
scores_if = -iso_forest.score_samples(X_sc) # higher = more anomalous
# LOCAL OUTLIER FACTOR
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_lof = lof.fit_predict(X_sc) # 1=normal, -1=anomaly
scores_lof = -lof.negative_outlier_factor_ # higher = more anomalous
# EVALUATE (we have ground truth here, but in real unsupervised -- you don't)
print("Isolation Forest:")
print(classification_report(y_true, y_if, labels=[-1, 1], target_names=["Anomaly", "Normal"]))
print("Local Outlier Factor:")
print(classification_report(y_true, y_lof, labels=[-1, 1], target_names=["Anomaly", "Normal"]))
auc_if = roc_auc_score(y_true == -1, scores_if)
auc_lof = roc_auc_score(y_true == -1, scores_lof)
print(f"Isolation Forest AUC: {auc_if:.4f}")
print(f"LOF AUC: {auc_lof:.4f}")
# VISUALIZE
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax, (preds, title) in zip(axes, [
(y_true, "Ground Truth"),
(y_if, "Isolation Forest"),
(y_lof, "Local Outlier Factor"),
]):
colors = ["tomato" if p == -1 else "steelblue" for p in preds]
ax.scatter(X[:, 0], X[:, 1], c=colors, s=30, alpha=0.7)
ax.set_title(title)
n_anom = (preds == -1).sum()
ax.set_xlabel(f"{n_anom} anomalies detected")
plt.tight_layout()
plt.savefig("anomaly_detection.png", dpi=100, bbox_inches="tight")
plt.show()
# CHOOSING ALGORITHM
algo_guide = {
"Isolation Forest": "Best for large datasets (>10k rows); low contamination (fraud, defects)",
"LOF": "Better for local density anomalies; smaller datasets (<10k); no predict() with new data",
"One-Class SVM": "High-dimensional data; can be slow on large datasets",
"AutoEncoder (DL)": "Complex high-dim data (images, text); when normal patterns are intricate",
}
print("\nAnomaly Detection Algorithm Guide:")
for algo, use_case in algo_guide.items():
print(f" {algo:20s}: {use_case}")Tip
Tip
Practice Anomaly Detection Isolation Forest LOF in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Anomaly Detection Isolation Forest LOF from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Anomaly Detection Isolation Forest LOF is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.