Handling Imbalanced Data — SMOTE & Class Weights
Two main strategies for imbalanced classification: (1) Adjust the algorithm — use class_weight='balanced' to penalize majority class misclassification less; no data modification needed, simple and fast. (2) Resample the data — SMOTE (Synthetic Minority Oversampling Technique) creates synthetic minority class samples based on feature-space interpolation between existing minority examples. SMOTE works better for severe imbalance.
Class Weights and SMOTE Oversampling
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
np.random.seed(42)
# IMBALANCED DATASET: 5% positive class (realistic churn scenario)
X, y = make_classification(
n_samples=5000, n_features=10, n_informative=5,
weights=[0.95, 0.05], # 95% class 0, 5% class 1
random_state=42,
)
print(f"Class distribution: {np.bincount(y)} (ratio {np.bincount(y)[0]/np.bincount(y)[1]:.0f}:1)")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# APPROACH 1: NO ADJUSTMENT (wrong -- shows the problem)
model_naive = LogisticRegression(max_iter=1000, random_state=42)
model_naive.fit(X_train, y_train)
print("\nNaive model (no imbalance handling):")
print(classification_report(y_test, model_naive.predict(X_test), target_names=["no churn", "churn"], digits=3))
# APPROACH 2: CLASS WEIGHTS
model_weighted = LogisticRegression(class_weight="balanced", max_iter=1000, random_state=42)
model_weighted.fit(X_train, y_train)
print("\nClass-weighted model:")
print(classification_report(y_test, model_weighted.predict(X_test), target_names=["no churn", "churn"], digits=3))
# APPROACH 3: SMOTE -- synthetic minority oversampling
# IMPORTANT: SMOTE must be applied ONLY to training data
smote = SMOTE(sampling_strategy=0.3, random_state=42) # make minority 30% of majority
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
print(f"\nAfter SMOTE: {np.bincount(y_train_resampled)} (was {np.bincount(y_train)})")
model_smote = LogisticRegression(max_iter=1000, random_state=42)
model_smote.fit(X_train_resampled, y_train_resampled)
print("\nSMOTE model:")
print(classification_report(y_test, model_smote.predict(X_test), target_names=["no churn", "churn"], digits=3))
# COMPARING AUC-ROC (better metric than accuracy for imbalanced data)
print("\nROC-AUC comparison (higher = better at distinguishing classes):")
for name, model, X_tr, y_tr in [
("Naive", model_naive, X_train, y_train),
("Class weights", model_weighted, X_train, y_train),
("SMOTE", model_smote, X_train_resampled, y_train_resampled),
]:
roc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
recall_pos = classification_report(y_test, model.predict(X_test), output_dict=True)["1"]["recall"]
print(f" {name:15s}: AUC={roc:.3f} | Minority recall={recall_pos:.3f}")
print("\nKey insight: Accuracy is misleading. Use AUC-ROC or F1 for imbalanced datasets.")Tip
Tip
Practice Handling Imbalanced Data SMOTE Class Weights in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
80% of ML work is data preparation — garbage in = garbage out
Practice Task
Note
Practice Task — (1) Write a working example of Handling Imbalanced Data SMOTE Class Weights from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Handling Imbalanced Data SMOTE Class Weights is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.