Cleaning Dirty Data — Real-World Messiness
Real datasets are never clean. Expect: wrong data types, inconsistent formatting, trailing whitespace, mixed cases, duplicate rows, impossible values (negative age, 999% discount), and outliers. Data cleaning is the most time-consuming part of any ML project — but the most important. Garbage in, garbage out.
Systematic Data Cleaning
import pandas as pd
import numpy as np
# SIMULATE DIRTY REAL-WORLD DATA
data = {
"customer_id": [101, 102, 103, 103, 104, 105, 106], # 103 duplicated!
"age": [28, -5, 35, 35, "unknown", 42, 150], # negative, string, impossible
"income": [" 45,000", "82000", "55000$", "55000$", "N/A", "75000", "90,000"],
"email": ["Alice@gmail.com", "bob@YAHOO.COM", " carol@hotmail.com ", None, "dave@gmail.com", "EVE@gmail.com", "frank@"],
"city": ["New York", "new york", "NEW YORK", "Los Angeles", "LA", "Chicago", " Chicago "],
"score": [750, 680, 720, 720, 600, None, 810],
}
df = pd.DataFrame(data)
print("RAW DIRTY DATA:")
print(df)
# STEP 1: FIX COLUMN NAMES
df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]
# STEP 2: REMOVE DUPLICATES
print(f"\nDuplicates before: {df.duplicated().sum()}")
df = df.drop_duplicates()
print(f"Duplicates after: {df.duplicated().sum()}")
# STEP 3: CLEAN INCOME (strip symbols, convert to numeric)
df["income"] = (df["income"]
.astype(str)
.str.strip()
.str.replace(",", "", regex=False)
.str.replace("$", "", regex=False)
.str.replace("N/A", "", regex=False)
)
df["income"] = pd.to_numeric(df["income"], errors="coerce") # invalid -> NaN
# STEP 4: CLEAN AGE (handle impossible values)
df["age"] = pd.to_numeric(df["age"], errors="coerce") # "unknown" -> NaN
df.loc[(df["age"] < 0) | (df["age"] > 120), "age"] = np.nan # invalid range -> NaN
# STEP 5: STANDARDIZE TEXT FIELDS
df["email"] = df["email"].str.strip().str.lower()
df["city"] = df["city"].str.strip().str.lower()
# Normalize city aliases
city_map = {"new york": "new york", "ny": "new york", "la": "los angeles", " chicago ": "chicago"}
df["city"] = df["city"].replace(city_map)
# STEP 6: VALIDATE EMAIL FORMAT
import re
email_pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
df["email_valid"] = df["email"].fillna("").apply(lambda x: bool(re.match(email_pattern, x)))
print("\nCLEANED DATA:")
print(df)
print("\nMissing values after cleaning:")
print(df.isnull().sum())Tip
Tip
Practice Cleaning Dirty Data RealWorld Messiness in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Cleaning Dirty Data RealWorld Messiness from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Cleaning Dirty Data RealWorld Messiness is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.