ColumnTransformer — Apply Different Preprocessing per Column
Real datasets mix numeric and categorical features needing different preprocessing. ColumnTransformer applies different transformers to different column groups simultaneously — StandardScaler to numerics, OneHotEncoder to categoricals — and combines the outputs into a single feature matrix. This is the RIGHT way to preprocess real data in scikit-learn.
ColumnTransformer for Mixed-Type Data
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
np.random.seed(42)
n = 800
df = pd.DataFrame({
"age": np.random.normal(38, 12, n).clip(18, 70),
"income": np.random.exponential(55000, n).clip(15000, 200000),
"credit": np.random.normal(680, 80, n).clip(300, 850),
"education": np.random.choice(["high_school","bachelor","master","phd"], n, p=[0.3,0.4,0.2,0.1]),
"employment": np.random.choice(["full-time","part-time","unemployed"], n, p=[0.65,0.2,0.15]),
"default": np.random.choice([0, 1], n, p=[0.82, 0.18]),
})
# Inject some missing values
df.loc[df.sample(50, random_state=1).index, "income"] = np.nan
df.loc[df.sample(30, random_state=2).index, "credit"] = np.nan
df.loc[df.sample(20, random_state=3).index, "education"] = np.nan
X = df.drop("default", axis=1)
y = df["default"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# DEFINE COLUMN GROUPS
numeric_features = ["age", "income", "credit"]
categorical_features = ["education", "employment"]
# BUILD PREPROCESSING PIPELINES FOR EACH GROUP
numeric_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")), # fill missing with median
("scaler", RobustScaler()), # scale (robust to outliers)
])
categorical_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")), # fill missing with mode
("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")),
])
# COMBINE WITH COLUMNTRANSFORMER
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_pipeline, numeric_features),
("cat", categorical_pipeline, categorical_features),
],
remainder="drop", # drop any unlisted columns (safety net)
verbose_feature_names_out=True # prefix column names with transformer name
)
# FIT ON TRAINING DATA ONLY -- transform both
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test) # no fit!
# SEE THE OUTPUT
feature_names = preprocessor.get_feature_names_out()
print("Output features:", feature_names.tolist())
print(f"\nInput shape: {X_train.shape}")
print(f"Output shape: {X_train_processed.shape}")
print(f"\nFirst row (processed):")
for name, val in zip(feature_names, X_train_processed[0].round(3)):
print(f" {name:40s}: {val:.3f}")
# USE MAKE_COLUMN_SELECTOR -- automatically detect by dtype
auto_preprocessor = ColumnTransformer([
("num", StandardScaler(),
make_column_selector(dtype_include=np.number)),
("cat", OneHotEncoder(handle_unknown="ignore"),
make_column_selector(dtype_include=object)),
])
print("\nauto_preprocessor uses dtype detection -- works for any dataset with consistent types")Tip
Tip
Practice ColumnTransformer Apply Different Preprocessing per Column in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
StandardScaler. OneHot. KNN imputer.
Practice Task
Note
Practice Task — (1) Write a working example of ColumnTransformer Apply Different Preprocessing per Column from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with ColumnTransformer Apply Different Preprocessing per Column is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.