Mini Project: Cleaning a Real-World Messy Dataset
Apply everything from this module to clean a realistic messy customer dataset with missing values, wrong types, duplicates, inconsistent formatting, and outliers. Produce a clean, ML-ready DataFrame with a documented cleaning pipeline that can be rerun on new data.
End-to-End Data Cleaning Pipeline
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
def generate_messy_dataset(n: int = 1000, seed: int = 42) -> pd.DataFrame:
rng = np.random.RandomState(seed)
df = pd.DataFrame({
"CustomerID": range(1001, 1001 + n),
"Age": rng.choice([str(x) for x in rng.randint(18,70,n)] + ["unknown","N/A"], n),
" Income ": [f"${rng.randint(20,150)*1000:,}" if rng.random() > 0.1 else "" for _ in range(n)],
"Education": rng.choice(["Bachelor","BACHELOR","bachelor","Master","MASTER","PhD","High School",None], n),
"Credit_Score": list(rng.normal(680, 80, n - 20).clip(300,850).astype(int)) + [9999]*10 + [None]*10,
"Loan_Amount": rng.exponential(15000, n).clip(1000, 80000).round(0),
"Default": rng.choice([0, 1, "yes", "no", None], n, p=[0.45, 0.1, 0.1, 0.3, 0.05]),
})
# Add duplicates (10% of rows)
dupes = df.sample(int(n * 0.1), random_state=seed)
return pd.concat([df, dupes], ignore_index=True)
# GENERATE AND CLEAN
raw = generate_messy_dataset(n=1000)
print(f"Raw data: {raw.shape}")
print(f"Duplicates: {raw.duplicated().sum()}")
def clean_dataset(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
# 1. Standardize column names
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]
# 2. Remove duplicates
df = df.drop_duplicates(subset=["customerid"])
# 3. Clean age
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df.loc[(df["age"] < 18) | (df["age"] > 100), "age"] = np.nan
# 4. Clean income
df["income"] = (df["income"].astype(str).str.strip()
.str.replace(r"[\$,]", "", regex=True)
.replace({"": np.nan}))
df["income"] = pd.to_numeric(df["income"], errors="coerce")
# 5. Standardize education
edu_map = {"bachelor": "bachelor", "master": "master", "phd": "phd", "high school": "high_school"}
df["education"] = df["education"].str.lower().str.strip().map(edu_map)
# 6. Fix credit score (9999 is sentinel for unknown)
df["credit_score"] = pd.to_numeric(df["credit_score"], errors="coerce")
df.loc[df["credit_score"] > 850, "credit_score"] = np.nan
# 7. Standardize target variable
df["default"] = df["default"].replace({"yes": 1, "no": 0})
df["default"] = pd.to_numeric(df["default"], errors="coerce").astype("Int64")
# 8. Add missingness indicators
for col in ["income", "credit_score"]:
df[f"{col}_missing"] = df[col].isnull().astype(int)
# 9. Impute remaining numeric nulls
num_cols = ["age", "income", "credit_score"]
imputer = SimpleImputer(strategy="median")
df[num_cols] = imputer.fit_transform(df[num_cols])
# 10. Drop rows with missing target
df = df.dropna(subset=["default"])
return df
clean = clean_dataset(raw)
print(f"\nClean data: {clean.shape}")
print(f"Missing values: {clean.isnull().sum().sum()}")
print(f"Dtypes:\n{clean.dtypes}")
print(f"\nDefault rate: {clean['default'].mean():.1%}")
print("\nReady for ML!")Tip
Tip
Practice Mini Project Cleaning a RealWorld Messy Dataset in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Mini Project Cleaning a RealWorld Messy Dataset from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Mini Project Cleaning a RealWorld Messy Dataset is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.