Target Encoding & High-Cardinality Categoricals
High-cardinality categorical features (city, product_id, user_id) cannot use OneHotEncoding — that would create thousands of columns. Target encoding replaces each category with the mean target value for that category, capturing predictive signal without explosion of features. The critical risk: target leakage — if done naively, the model sees target information in the features. Use cross-fold target encoding to prevent this.
Target Encoding with Cross-Validation Anti-Leakage
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
np.random.seed(42)
N = 3000
# HIGH-CARDINALITY: 200 cities
cities = [f"city_{i}" for i in range(200)]
df = pd.DataFrame({
"city": np.random.choice(cities, N),
"age": np.random.normal(38, 12, N).clip(18, 75),
"income": np.random.exponential(55000, N).clip(15000, 200000),
"default": np.random.choice([0, 1], N, p=[0.83, 0.17]),
})
# Make some cities genuinely high-risk
high_risk_cities = cities[:30] # first 30 cities have higher default rates
df.loc[df["city"].isin(high_risk_cities), "default"] = np.random.choice([0, 1], (df["city"].isin(high_risk_cities)).sum(), p=[0.55, 0.45])
print(f"Dataset: {df.shape} | {df['city'].nunique()} unique cities | Default rate: {df['default'].mean():.1%}")
X = df.drop("default", axis=1)
y = df["default"]
# METHOD 1: ONE-HOT (fails for high cardinality)
# Would create 200 binary columns -- sparse, slow, bad for many models
# METHOD 2: NAIVE TARGET ENCODING (causes leakage!)
global_mean = y.mean()
city_target = y.groupby(df["city"]).mean() # computed on ALL data -- LEAKY!
df["city_naive_te"] = df["city"].map(city_target)
X_naive = df[["age", "income", "city_naive_te"]]
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
naive_auc = cross_val_score(model, X_naive, y, cv=5, scoring="roc_auc").mean()
print(f"\nNaive target encoding AUC: {naive_auc:.4f} (SUSPICIOUS -- likely leaked!)")
# METHOD 3: CROSS-FOLD TARGET ENCODING (correct!)
def cross_fold_target_encode(X: pd.DataFrame, y: pd.Series, col: str, n_folds: int = 5, smoothing: float = 10.0) -> pd.Series:
"""Target encode with cross-fold to prevent leakage."""
global_mean = y.mean()
encoded = pd.Series(index=X.index, dtype=float)
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(X):
X_tr, y_tr = X.iloc[train_idx], y.iloc[train_idx]
X_val = X.iloc[val_idx]
# Compute target means from training fold only
fold_means = y_tr.groupby(X_tr[col]).mean()
fold_counts = y_tr.groupby(X_tr[col]).count()
# Smoothing: blend category mean with global mean (prevents overfitting on rare categories)
smoothed = (fold_means * fold_counts + global_mean * smoothing) / (fold_counts + smoothing)
# Encode validation fold using training fold statistics
encoded.iloc[val_idx] = X_val[col].map(smoothed).fillna(global_mean)
return encoded
df["city_cv_te"] = cross_fold_target_encode(df, y, "city", n_folds=5, smoothing=10)
X_cv_te = df[["age", "income", "city_cv_te"]]
cv_te_auc = cross_val_score(model, X_cv_te, y, cv=5, scoring="roc_auc").mean()
print(f"CV target encoding AUC: {cv_te_auc:.4f} (honest estimate)")
# ALSO: sklearn's TargetEncoder (sklearn >= 1.3)
from sklearn.preprocessing import TargetEncoder
te = TargetEncoder(target_type="binary", cv=5, smooth="auto")
X_sklearn_te = df[["age", "income"]].copy()
X_sklearn_te["city"] = df["city"]
X_sklearn_te["city_te"] = te.fit_transform(X_sklearn_te[["city"]], y)[:, 0]
X_for_model = X_sklearn_te[["age", "income", "city_te"]]
sk_auc = cross_val_score(model, X_for_model, y, cv=5, scoring="roc_auc").mean()
print(f"sklearn TargetEncoder AUC: {sk_auc:.4f} (uses CV internally)")
print("\nSmoothing in target encoding:")
print(" category has 3 samples -> blend heavily with global mean (avoid overfitting)")
print(" category has 500 samples -> mostly use category mean (reliable estimate)")Tip
Tip
Practice Target Encoding HighCardinality Categoricals in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Neural networks learn by adjusting connection weights via backpropagation
Practice Task
Note
Practice Task — (1) Write a working example of Target Encoding HighCardinality Categoricals from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Target Encoding HighCardinality Categoricals is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.