Mini Project: EDA Report on a Real Dataset
Build a complete, shareable EDA report on the California Housing dataset — a real-world regression problem used by researchers and practitioners. The report covers: dataset overview, distribution analysis, correlation study, geographic visualization, feature-target relationships, and a summary of ML-relevant findings that will guide preprocessing and model selection in subsequent modules.
Complete EDA on California Housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# LOAD DATA
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["MedHouseVal"] = housing.target # median house value in $100k
print("=== CALIFORNIA HOUSING EDA REPORT ===")
print(f"\nDataset: {df.shape[0]} census blocks, {df.shape[1]} features")
print("\n1. DATA OVERVIEW")
print(df.describe().round(2))
print("\n2. MISSING VALUES:", df.isnull().sum().sum(), "(none expected in sklearn datasets)")
# KEY FEATURE DESCRIPTIONS
feature_desc = {
"MedInc": "Median income (in $10,000s)",
"HouseAge": "Median house age (years)",
"AveRooms": "Average rooms per household",
"AveBedrms": "Average bedrooms per household",
"Population": "Block population",
"AveOccup": "Average household occupants",
"Latitude": "Geographic latitude",
"Longitude": "Geographic longitude",
"MedHouseVal":"TARGET: Median house value (in $100,000s)",
}
print("\n3. FEATURE DESCRIPTIONS:")
for feat, desc in feature_desc.items():
print(f" {feat:12s}: {desc}")
# DISTRIBUTION ANALYSIS
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()
for i, col in enumerate(df.columns):
axes[i].hist(df[col], bins=40, color="steelblue", edgecolor="white", alpha=0.8)
axes[i].set_title(col)
skew = df[col].skew()
axes[i].set_xlabel(f"skew={skew:.2f}" + (" (log needed)" if abs(skew) > 1 else ""))
plt.suptitle("Feature Distributions — California Housing", y=1.02, fontsize=14)
plt.tight_layout()
plt.savefig("housing_distributions.png", dpi=100, bbox_inches="tight")
plt.show()
# CORRELATION WITH TARGET
print("\n4. CORRELATION WITH TARGET (MedHouseVal):")
corr_target = df.corr()["MedHouseVal"].drop("MedHouseVal").sort_values(ascending=False)
for feat, corr in corr_target.items():
bar = "+" * int(abs(corr) * 20) if corr > 0 else "-" * int(abs(corr) * 20)
print(f" {feat:12s}: {corr:+.3f} {bar}")
# GEOGRAPHIC VISUALIZATION
fig, ax = plt.subplots(figsize=(10, 7))
scatter = ax.scatter(
df["Longitude"], df["Latitude"],
c=df["MedHouseVal"], cmap="RdYlGn",
s=df["Population"] / 200, alpha=0.5,
)
plt.colorbar(scatter, ax=ax, label="Median House Value ($100k)")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title("California Housing: Price & Population by Location\n(dot size = population)")
plt.tight_layout()
plt.savefig("housing_map.png", dpi=100, bbox_inches="tight")
plt.show()
# EDA CONCLUSIONS -- what to do in preprocessing
print("\n5. EDA CONCLUSIONS (for preprocessing):")
conclusions = [
"MedInc: highest correlation (+0.69) with target -- most important feature",
"Population and AveOccup: right-skewed, consider log transform",
"AveRooms/AveBedrms: some very large values (outliers -- gated communities?)",
"Latitude/Longitude: geographic features, consider interaction term",
"No missing values -- skip imputation",
"All features numeric -- no encoding needed",
"Target capped at 5.0 (500k) -- be aware of truncation",
]
for c in conclusions:
print(f" -> {c}")Tip
Tip
Practice Mini Project EDA Report on a Real Dataset in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
F1 = harmonic mean of Precision and Recall (balanced metric)
Practice Task
Note
Practice Task — (1) Write a working example of Mini Project EDA Report on a Real Dataset from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Mini Project EDA Report on a Real Dataset is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.