Mini Project: Cleaning a Real-World Messy Dataset

Apply everything from this module to clean a realistic messy customer dataset with missing values, wrong types, duplicates, inconsistent formatting, and outliers. Produce a clean, ML-ready DataFrame with a documented cleaning pipeline that can be rerun on new data.

45 min•By Priygop Team•Updated 2026

End-to-End Data Cleaning Pipeline

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

def generate_messy_dataset(n: int = 1000, seed: int = 42) -> pd.DataFrame:
    rng = np.random.RandomState(seed)
    df = pd.DataFrame({
        "CustomerID":    range(1001, 1001 + n),
        "Age":           rng.choice([str(x) for x in rng.randint(18,70,n)] + ["unknown","N/A"], n),
        "  Income ":     [f"${rng.randint(20,150)*1000:,}" if rng.random() > 0.1 else "" for _ in range(n)],
        "Education":     rng.choice(["Bachelor","BACHELOR","bachelor","Master","MASTER","PhD","High School",None], n),
        "Credit_Score":  list(rng.normal(680, 80, n - 20).clip(300,850).astype(int)) + [9999]*10 + [None]*10,
        "Loan_Amount":   rng.exponential(15000, n).clip(1000, 80000).round(0),
        "Default":       rng.choice([0, 1, "yes", "no", None], n, p=[0.45, 0.1, 0.1, 0.3, 0.05]),
    })
    # Add duplicates (10% of rows)
    dupes = df.sample(int(n * 0.1), random_state=seed)
    return pd.concat([df, dupes], ignore_index=True)

# GENERATE AND CLEAN
raw = generate_messy_dataset(n=1000)
print(f"Raw data: {raw.shape}")
print(f"Duplicates: {raw.duplicated().sum()}")

def clean_dataset(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    # 1. Standardize column names
    df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

    # 2. Remove duplicates
    df = df.drop_duplicates(subset=["customerid"])

    # 3. Clean age
    df["age"] = pd.to_numeric(df["age"], errors="coerce")
    df.loc[(df["age"] < 18) | (df["age"] > 100), "age"] = np.nan

    # 4. Clean income
    df["income"] = (df["income"].astype(str).str.strip()
                       .str.replace(r"[\$,]", "", regex=True)
                       .replace({"": np.nan}))
    df["income"] = pd.to_numeric(df["income"], errors="coerce")

    # 5. Standardize education
    edu_map = {"bachelor": "bachelor", "master": "master", "phd": "phd", "high school": "high_school"}
    df["education"] = df["education"].str.lower().str.strip().map(edu_map)

    # 6. Fix credit score (9999 is sentinel for unknown)
    df["credit_score"] = pd.to_numeric(df["credit_score"], errors="coerce")
    df.loc[df["credit_score"] > 850, "credit_score"] = np.nan

    # 7. Standardize target variable
    df["default"] = df["default"].replace({"yes": 1, "no": 0})
    df["default"] = pd.to_numeric(df["default"], errors="coerce").astype("Int64")

    # 8. Add missingness indicators
    for col in ["income", "credit_score"]:
        df[f"{col}_missing"] = df[col].isnull().astype(int)

    # 9. Impute remaining numeric nulls
    num_cols = ["age", "income", "credit_score"]
    imputer = SimpleImputer(strategy="median")
    df[num_cols] = imputer.fit_transform(df[num_cols])

    # 10. Drop rows with missing target
    df = df.dropna(subset=["default"])

    return df

clean = clean_dataset(raw)
print(f"\nClean data: {clean.shape}")
print(f"Missing values: {clean.isnull().sum().sum()}")
print(f"Dtypes:\n{clean.dtypes}")
print(f"\nDefault rate: {clean['default'].mean():.1%}")
print("\nReady for ML!")

Tip

Practice Mini Project Cleaning a RealWorld Messy Dataset in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Technical diagram.

Practice Task

Note

Practice Task — (1) Write a working example of Mini Project Cleaning a RealWorld Messy Dataset from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Mini Project Cleaning a RealWorld Messy Dataset is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module