Detecting & Handling Outliers

Outliers can be genuine rare events (fraud, defect), data entry errors, or sensor malfunctions. Blindly removing them is as dangerous as blindly keeping them. Use multiple detection methods, then decide: remove, cap (winsorize), or model separately. Tree-based models (Random Forest, XGBoost) are naturally robust to outliers. Linear models are very sensitive.

20 min•By Priygop Team•Updated 2026

Outlier Detection and Treatment

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)
n = 500
income = np.random.exponential(60000, n)
# Inject outliers
income[10:13] = [500000, 800000, 1200000]   # data entry errors?
income[20]    = -5000                         # impossible negative

df = pd.DataFrame({"income": income})

# METHOD 1: DESCRIPTIVE STATS
print("Basic statistics:")
print(df["income"].describe().round(0))

# METHOD 2: IQR (Interquartile Range) -- robust to outliers
Q1 = df["income"].quantile(0.25)
Q3 = df["income"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers_iqr = df[(df["income"] < lower) | (df["income"] > upper)]
print(f"\nIQR method: lower={lower:.0f}, upper={upper:.0f}")
print(f"Outliers detected: {len(outliers_iqr)} ({len(outliers_iqr)/len(df):.1%})")
print(f"Outlier values: {sorted(outliers_iqr['income'].values.round(0).tolist())}")

# METHOD 3: Z-SCORE -- how many std devs from mean
z_scores = np.abs(stats.zscore(df["income"]))
outliers_z = df[z_scores > 3]
print(f"\nZ-score (>3) method: {len(outliers_z)} outliers detected")

# METHOD 4: Percentile capping (Winsorization) -- preserve rows, cap extreme values
lower_cap = df["income"].quantile(0.01)
upper_cap = df["income"].quantile(0.99)
df["income_capped"] = df["income"].clip(lower=lower_cap, upper=upper_cap)
print(f"\nWinsorization: cap at [{lower_cap:.0f}, {upper_cap:.0f}]")
print(f"Max before: {df['income'].max():.0f} | Max after capping: {df['income_capped'].max():.0f}")

# WHICH TREATMENT TO CHOOSE?
decision_guide = {
    "Remove rows":          "Use when outlier is clearly a data error AND you have many rows",
    "Cap/Winsorize":        "Use when outlier may be real but extreme -- preserves row, limits influence",
    "Log transform":        "Use for right-skewed data (income, prices) -- log(x) compresses extremes",
    "Keep as-is":           "Use with tree models (rf, xgboost) -- they naturally handle outliers",
    "Separate model":       "Use for fraud/anomaly detection -- the outlier IS the signal",
}

print("\nOutlier Treatment Decision Guide:")
for treatment, when_to_use in decision_guide.items():
    print(f"  {treatment:20s}: {when_to_use}")

# LOG TRANSFORM for right-skewed income
df["income_log"] = np.log1p(df["income"].clip(lower=1))  # log1p handles zeros
print(f"\nOriginal skewness: {df['income'].skew():.2f}")
print(f"Log-transformed skewness: {df['income_log'].skew():.2f}")
print("(Closer to 0 = more normal distribution = better for linear models)")

Tip

Practice Detecting Handling Outliers in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

F1 = harmonic mean of Precision and Recall (balanced metric)

Practice Task

Note

Practice Task — (1) Write a working example of Detecting Handling Outliers from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Detecting Handling Outliers is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module