Categorical Feature Analysis
Categorical features need special EDA: count plots for distribution, grouped bar charts to show relationship with target, chi-square test to quantify statistical association, and cardinality check (how many unique values — affects encoding choice). A categorical feature with 1000 unique values (city names) needs different treatment than one with 3 values (education level).
Categorical Feature EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
np.random.seed(42)
n = 2000
df = pd.DataFrame({
"income": np.random.normal(60000, 20000, n).clip(15000, 200000),
"education": np.random.choice(["high_school", "bachelor", "master", "phd"], n, p=[0.3, 0.4, 0.2, 0.1]),
"employment": np.random.choice(["full-time", "part-time", "unemployed", "self-employed"], n, p=[0.6, 0.15, 0.1, 0.15]),
"region": np.random.choice(["north", "south", "east", "west"], n),
"default": np.random.choice([0, 1], n, p=[0.82, 0.18]),
})
# Make defaults more likely for unemployed and high_school
df.loc[df["employment"] == "unemployed", "default"] = np.random.choice([0, 1], (df["employment"] == "unemployed").sum(), p=[0.5, 0.5])
df.loc[df["education"] == "high_school", "default"] = np.random.choice([0, 1], (df["education"] == "high_school").sum(), p=[0.65, 0.35])
# STEP 1: CARDINALITY CHECK
print("Categorical feature cardinality:")
cat_cols = ["education", "employment", "region"]
for col in cat_cols:
n_unique = df[col].nunique()
print(f" {col:15s}: {n_unique} unique values -> ", end="")
if n_unique <= 5: print("Low cardinality -> OneHotEncode")
elif n_unique <= 20: print("Medium cardinality -> OneHot or OrdinalEncode")
else: print("High cardinality -> TargetEncode or hash")
# STEP 2: DISTRIBUTION + TARGET RELATIONSHIP
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for i, col in enumerate(cat_cols):
# Top row: overall distribution
order = df[col].value_counts().index
df[col].value_counts().plot(kind="bar", ax=axes[0, i], rot=30, color="steelblue")
axes[0, i].set_title(f"Distribution: {col}")
# Bottom row: default rate by category
default_rate = df.groupby(col)["default"].mean().sort_values(ascending=False)
colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(default_rate)))
default_rate.plot(kind="bar", ax=axes[1, i], rot=30, color=colors)
axes[1, i].set_title(f"Default Rate by {col}")
axes[1, i].set_ylabel("Default Rate")
axes[1, i].axhline(df["default"].mean(), color="red", linestyle="--", linewidth=1, label="overall avg")
axes[1, i].legend(fontsize=8)
plt.tight_layout()
plt.savefig("categorical_eda.png", dpi=100, bbox_inches="tight")
plt.show()
# STEP 3: CHI-SQUARE TEST -- is the association statistically significant?
print("\nChi-square test: is categorical feature associated with target?")
for col in cat_cols:
contingency = pd.crosstab(df[col], df["default"])
chi2, p_value, dof, _ = chi2_contingency(contingency)
significance = "SIGNIFICANT" if p_value < 0.05 else "not significant"
print(f" {col:15s}: chi2={chi2:.1f}, p={p_value:.4f} -> {significance}")Tip
Tip
Practice Categorical Feature Analysis in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Feature engineering = 80% of ML success.
Practice Task
Note
Practice Task — (1) Write a working example of Categorical Feature Analysis from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Categorical Feature Analysis is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.