Class Imbalance Detection & Visualization
Class imbalance is when one class has far fewer samples than another. Common in fraud detection (0.1% fraud), medical diagnosis (5% disease), and churn prediction (10% churn). Imbalance causes models to predict the majority class always — achieving high accuracy but zero utility. EDA catches this before you waste time training a useless model.
Detecting and Quantifying Class Imbalance
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
# CREATE IMBALANCED DATASET (realistic fraud scenario)
n = 10000
df = pd.DataFrame({
"amount": np.random.exponential(150, n),
"hour": np.random.randint(0, 24, n),
"merchant": np.random.choice(["retail", "online", "food", "travel"], n),
"is_fraud": np.random.choice([0, 1], n, p=[0.994, 0.006]), # 0.6% fraud -- realistic!
})
# STEP 1: DETECT IMBALANCE
counts = df["is_fraud"].value_counts()
proportions = df["is_fraud"].value_counts(normalize=True)
print("Class Distribution:")
print(f" Legitimate: {counts[0]:,} ({proportions[0]:.1%})")
print(f" Fraud: {counts[1]:,} ({proportions[1]:.1%})")
print(f" Imbalance ratio: {counts[0]/counts[1]:.0f}:1")
# SEVERITY GUIDE
imbalance_ratios = {
"2:1 (mild)": "No special treatment usually needed",
"5:1 (moderate)": "Consider class_weight='balanced' in model",
"10:1 (high)": "Use SMOTE or class weights; use F1/ROC-AUC not accuracy",
"50:1 (severe)": "Anomaly detection approach; careful evaluation; business cost matrix",
"100:1 (extreme)": "Consider precision-recall tradeoff; alert on any positive",
}
print("\nImbalance severity guide:")
for ratio, advice in imbalance_ratios.items():
print(f" {ratio:20s}: {advice}")
# STEP 2: VISUALIZE
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Bar chart
proportions.plot(kind="bar", ax=axes[0], color=["steelblue", "tomato"], rot=0)
axes[0].set_title("Class Distribution")
axes[0].set_xticklabels(["Legitimate", "Fraud"])
axes[0].set_ylabel("Count")
for bar in axes[0].patches:
axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
f"{bar.get_height():.0f}", ha="center", va="bottom", fontsize=9)
# Compare feature distributions by class
for i, col in enumerate(["amount", "hour"], start=1):
for label, color in [(0, "steelblue"), (1, "tomato")]:
name = "legitimate" if label == 0 else "fraud"
sns.kdeplot(df[df["is_fraud"] == label][col], ax=axes[i], fill=True, alpha=0.4, label=name, color=color)
axes[i].set_title(f"{col} Distribution by Class")
axes[i].legend()
plt.tight_layout()
plt.savefig("class_imbalance.png", dpi=100, bbox_inches="tight")
plt.show()
# STEP 3: WHAT THIS MEANS FOR MODELING
print("\nImplication for modeling:")
print(" Naive model (always predict 0):")
print(f" Accuracy: {proportions[0]:.1%} -- looks great! But catches ZERO fraud")
print(" Solution: Use class_weight='balanced' or SMOTE + evaluate with F1/ROC-AUC")
print(" Module 10 covers imbalanced learning in depth")Tip
Tip
Practice Class Imbalance Detection Visualization in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Class Imbalance Detection Visualization from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Class Imbalance Detection Visualization is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.