Categorical Feature Analysis

Categorical features need special EDA: count plots for distribution, grouped bar charts to show relationship with target, chi-square test to quantify statistical association, and cardinality check (how many unique values — affects encoding choice). A categorical feature with 1000 unique values (city names) needs different treatment than one with 3 values (education level).

15 min•By Priygop Team•Updated 2026

Categorical Feature EDA

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

np.random.seed(42)
n = 2000
df = pd.DataFrame({
    "income":     np.random.normal(60000, 20000, n).clip(15000, 200000),
    "education":  np.random.choice(["high_school", "bachelor", "master", "phd"], n, p=[0.3, 0.4, 0.2, 0.1]),
    "employment": np.random.choice(["full-time", "part-time", "unemployed", "self-employed"], n, p=[0.6, 0.15, 0.1, 0.15]),
    "region":     np.random.choice(["north", "south", "east", "west"], n),
    "default":    np.random.choice([0, 1], n, p=[0.82, 0.18]),
})
# Make defaults more likely for unemployed and high_school
df.loc[df["employment"] == "unemployed", "default"] = np.random.choice([0, 1], (df["employment"] == "unemployed").sum(), p=[0.5, 0.5])
df.loc[df["education"] == "high_school", "default"] = np.random.choice([0, 1], (df["education"] == "high_school").sum(), p=[0.65, 0.35])

# STEP 1: CARDINALITY CHECK
print("Categorical feature cardinality:")
cat_cols = ["education", "employment", "region"]
for col in cat_cols:
    n_unique = df[col].nunique()
    print(f"  {col:15s}: {n_unique} unique values -> ", end="")
    if n_unique <= 5:    print("Low cardinality -> OneHotEncode")
    elif n_unique <= 20: print("Medium cardinality -> OneHot or OrdinalEncode")
    else:                print("High cardinality -> TargetEncode or hash")

# STEP 2: DISTRIBUTION + TARGET RELATIONSHIP
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for i, col in enumerate(cat_cols):
    # Top row: overall distribution
    order = df[col].value_counts().index
    df[col].value_counts().plot(kind="bar", ax=axes[0, i], rot=30, color="steelblue")
    axes[0, i].set_title(f"Distribution: {col}")

    # Bottom row: default rate by category
    default_rate = df.groupby(col)["default"].mean().sort_values(ascending=False)
    colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(default_rate)))
    default_rate.plot(kind="bar", ax=axes[1, i], rot=30, color=colors)
    axes[1, i].set_title(f"Default Rate by {col}")
    axes[1, i].set_ylabel("Default Rate")
    axes[1, i].axhline(df["default"].mean(), color="red", linestyle="--", linewidth=1, label="overall avg")
    axes[1, i].legend(fontsize=8)

plt.tight_layout()
plt.savefig("categorical_eda.png", dpi=100, bbox_inches="tight")
plt.show()

# STEP 3: CHI-SQUARE TEST -- is the association statistically significant?
print("\nChi-square test: is categorical feature associated with target?")
for col in cat_cols:
    contingency = pd.crosstab(df[col], df["default"])
    chi2, p_value, dof, _ = chi2_contingency(contingency)
    significance = "SIGNIFICANT" if p_value < 0.05 else "not significant"
    print(f"  {col:15s}: chi2={chi2:.1f}, p={p_value:.4f} -> {significance}")

Tip

Practice Categorical Feature Analysis in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Feature engineering = 80% of ML success.

Practice Task

Note

Practice Task — (1) Write a working example of Categorical Feature Analysis from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Categorical Feature Analysis is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module