Detecting & Handling Outliers
Outliers can be genuine rare events (fraud, defect), data entry errors, or sensor malfunctions. Blindly removing them is as dangerous as blindly keeping them. Use multiple detection methods, then decide: remove, cap (winsorize), or model separately. Tree-based models (Random Forest, XGBoost) are naturally robust to outliers. Linear models are very sensitive.
Outlier Detection and Treatment
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(42)
n = 500
income = np.random.exponential(60000, n)
# Inject outliers
income[10:13] = [500000, 800000, 1200000] # data entry errors?
income[20] = -5000 # impossible negative
df = pd.DataFrame({"income": income})
# METHOD 1: DESCRIPTIVE STATS
print("Basic statistics:")
print(df["income"].describe().round(0))
# METHOD 2: IQR (Interquartile Range) -- robust to outliers
Q1 = df["income"].quantile(0.25)
Q3 = df["income"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers_iqr = df[(df["income"] < lower) | (df["income"] > upper)]
print(f"\nIQR method: lower={lower:.0f}, upper={upper:.0f}")
print(f"Outliers detected: {len(outliers_iqr)} ({len(outliers_iqr)/len(df):.1%})")
print(f"Outlier values: {sorted(outliers_iqr['income'].values.round(0).tolist())}")
# METHOD 3: Z-SCORE -- how many std devs from mean
z_scores = np.abs(stats.zscore(df["income"]))
outliers_z = df[z_scores > 3]
print(f"\nZ-score (>3) method: {len(outliers_z)} outliers detected")
# METHOD 4: Percentile capping (Winsorization) -- preserve rows, cap extreme values
lower_cap = df["income"].quantile(0.01)
upper_cap = df["income"].quantile(0.99)
df["income_capped"] = df["income"].clip(lower=lower_cap, upper=upper_cap)
print(f"\nWinsorization: cap at [{lower_cap:.0f}, {upper_cap:.0f}]")
print(f"Max before: {df['income'].max():.0f} | Max after capping: {df['income_capped'].max():.0f}")
# WHICH TREATMENT TO CHOOSE?
decision_guide = {
"Remove rows": "Use when outlier is clearly a data error AND you have many rows",
"Cap/Winsorize": "Use when outlier may be real but extreme -- preserves row, limits influence",
"Log transform": "Use for right-skewed data (income, prices) -- log(x) compresses extremes",
"Keep as-is": "Use with tree models (rf, xgboost) -- they naturally handle outliers",
"Separate model": "Use for fraud/anomaly detection -- the outlier IS the signal",
}
print("\nOutlier Treatment Decision Guide:")
for treatment, when_to_use in decision_guide.items():
print(f" {treatment:20s}: {when_to_use}")
# LOG TRANSFORM for right-skewed income
df["income_log"] = np.log1p(df["income"].clip(lower=1)) # log1p handles zeros
print(f"\nOriginal skewness: {df['income'].skew():.2f}")
print(f"Log-transformed skewness: {df['income_log'].skew():.2f}")
print("(Closer to 0 = more normal distribution = better for linear models)")Tip
Tip
Practice Detecting Handling Outliers in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
F1 = harmonic mean of Precision and Recall (balanced metric)
Practice Task
Note
Practice Task — (1) Write a working example of Detecting Handling Outliers from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Detecting Handling Outliers is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.