Handling Missing Values — Strategies & Imputation
Missing data is in every real dataset. The strategy depends on WHY data is missing: MCAR (Missing Completely At Random — safe to drop or impute), MAR (Missing At Random — impute based on other features), MNAR (Missing Not At Random — missing VALUE itself is informative, e.g., patients who drop out of a trial have worse outcomes). Never blindly drop or fill without understanding the cause.
Missing Value Analysis and Imputation
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
np.random.seed(42)
n = 500
df = pd.DataFrame({
"age": np.random.normal(40, 10, n).clip(18, 80),
"income": np.random.exponential(55000, n).clip(15000, 250000),
"credit_score": np.random.normal(680, 80, n).clip(300, 850),
"employment": np.random.choice(["full-time","part-time","unemployed"], n, p=[0.7,0.2,0.1]),
"default": np.random.choice([0, 1], n, p=[0.85, 0.15]),
})
# Inject missing values (realistic pattern)
df.loc[df.sample(50, random_state=1).index, "income"] = np.nan # 10% missing
df.loc[df.sample(30, random_state=2).index, "credit_score"] = np.nan # 6% missing
df.loc[df.sample(15, random_state=3).index, "employment"] = np.nan # 3% missing
df.loc[df.sample(10, random_state=4).index, "age"] = np.nan # 2% missing
# STEP 1: ANALYZE MISSING PATTERNS
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
print("Missing value analysis:")
print(pd.DataFrame({"Count": missing, "Percentage": missing_pct}))
# Rule of thumb: >50% missing -> usually drop the column
# MNAR: missingness is related to the value itself -> "missing" as a feature
# STEP 2: ADD MISSINGNESS INDICATORS (when MNAR suspected)
df["income_missing"] = df["income"].isnull().astype(int)
df["credit_score_missing"] = df["credit_score"].isnull().astype(int)
# STEP 3: IMPUTATION STRATEGIES
numeric_cols = ["age", "income", "credit_score"]
# A) Mean / Median imputation (fast, assumes MCAR)
median_imputer = SimpleImputer(strategy="median")
df_median = df.copy()
df_median[numeric_cols] = median_imputer.fit_transform(df[numeric_cols])
print("\nMedian imputation -- income null:", df_median["income"].isnull().sum())
# B) KNN Imputation (uses similar rows to impute -- better for MAR)
knn_imputer = KNNImputer(n_neighbors=5)
df_knn = df.copy()
df_knn[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])
print("KNN imputation -- income null:", df_knn["income"].isnull().sum())
# C) Iterative Imputation (mouse/mice -- models each feature from others)
iter_imputer = IterativeImputer(random_state=42, max_iter=10)
df_iter = df.copy()
df_iter[numeric_cols] = iter_imputer.fit_transform(df[numeric_cols])
print("Iterative impute -- income null:", df_iter["income"].isnull().sum())
# D) Categorical: mode imputation
mode_imputer = SimpleImputer(strategy="most_frequent")
df["employment"] = mode_imputer.fit_transform(df[["employment"]]).ravel()
# COMPARE STRATEGIES
print("\nIncome distribution after imputation:")
strategies = {"Original (no nulls)": df["income"].dropna(),
"Median": df_median["income"], "KNN": df_knn["income"], "Iterative": df_iter["income"]}
for name, series in strategies.items():
print(f" {name:22s}: mean={series.mean():,.0f}, std={series.std():,.0f}")Tip
Tip
Practice Handling Missing Values Strategies Imputation in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
80% of ML work is data preparation — garbage in = garbage out
Practice Task
Note
Practice Task — (1) Write a working example of Handling Missing Values Strategies Imputation from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Handling Missing Values Strategies Imputation is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.