Categorical Encoding — OneHot, Ordinal, Target
ML models require numbers — but most real data has categories. Choosing the wrong encoding corrupts model training. OneHot creates binary dummy variables (correct for nominal categories). Ordinal encoding maps categories to ordered integers (correct only when order truly exists). Target encoding replaces categories with their mean target value (powerful for high-cardinality, but needs careful cross-validation to avoid leakage).
Encoding Strategies for Categorical Features
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
np.random.seed(42)
n = 1000
df = pd.DataFrame({
"education": np.random.choice(["high_school", "bachelor", "master", "phd"], n, p=[0.3, 0.4, 0.2, 0.1]),
"city": np.random.choice(["New York", "Chicago", "LA", "Houston", "Phoenix"], n),
"size": np.random.choice(["small", "medium", "large"], n),
"salary": np.random.normal(65000, 20000, n).clip(20000, 200000),
})
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)
# 1. LABEL ENCODING -- use ONLY for binary or when passing to tree models
# Warning: for multiclass nominal, implies false ordering (NY=0 < Chicago=1 < LA=2 is WRONG)
le = LabelEncoder()
train_city_label = le.fit_transform(X_train["city"]) # 0-4 but ordering is meaningless
print("LabelEncoded city (first 5):", train_city_label[:5])
print(" WARNING: Only use for binary features or tree-based models")
# 2. ONE-HOT ENCODING -- correct for nominal categories (no order)
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore", drop="first")
# drop="first": avoids dummy variable trap (multicollinearity in linear models)
X_train_ohe = ohe.fit_transform(X_train[["city", "education"]])
X_test_ohe = ohe.transform(X_test[["city", "education"]]) # transform only!
print("\nOneHot feature names:", ohe.get_feature_names_out().tolist())
print(f"Shape: {X_train.shape} -> {X_train_ohe.shape} (plus OHE columns)")
print("\nOneHot example (first row):", X_train_ohe[0].round(1).tolist())
# 3. ORDINAL ENCODING -- correct for ordered categories
ordinal_enc = OrdinalEncoder(categories=[["high_school", "bachelor", "master", "phd"]])
X_train["education_ord"] = ordinal_enc.fit_transform(X_train[["education"]])
X_test["education_ord"] = ordinal_enc.transform(X_test[["education"]])
print("\nOrdinal encoded education (first 5):", X_train["education_ord"].values[:5])
# 4. TARGET ENCODING -- replace category with mean target value
# MUST be done only on training folds to avoid leakage
print("\nTarget encoding (manual implementation with train-only fit):")
target_means = X_train.groupby("city")["salary"].mean()
X_train["city_target_enc"] = X_train["city"].map(target_means)
# For test: map using train city means (unknown cities -> global mean)
global_mean = X_train["salary"].mean()
X_test["city_target_enc"] = X_test["city"].map(target_means).fillna(global_mean)
print(target_means.round(0))
print(" Use category_encoders or sklearn TargetEncoder for production")
# CARDINALITY AND ENCODING DECISION
print("\nEncoding Decision Guide:")
decisions = {
"Binary (yes/no)": "LabelEncoder or map {'yes':1, 'no':0}",
"Nominal <= 10 categories": "OneHotEncoder (most linear and tree models)",
"Ordinal (ordered)": "OrdinalEncoder with explicit category order",
"High cardinality (>15)": "TargetEncoder or hash encoding (with CV to prevent leakage)",
"Very high card (cities)": "TargetEncoder in Pipeline with cross-validation",
}
for case, solution in decisions.items():
print(f" {case:30s}: {solution}")Tip
Tip
Practice Categorical Encoding OneHot Ordinal Target in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
θ = θ - α × ∇L(θ). Too high α = diverge. Too low = slow.
Practice Task
Note
Practice Task — (1) Write a working example of Categorical Encoding OneHot Ordinal Target from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Categorical Encoding OneHot Ordinal Target is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.