Building a Full Preprocessing Pipeline
A scikit-learn Pipeline chains preprocessing and model training into a single object. This eliminates data leakage (fit always uses training data only), enables cross-validation on the full pipeline, and makes deployment a single-step operation. The Pipeline is the single most important scikit-learn pattern for production ML.
End-to-End Pipeline: Preprocessing + Model
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib
np.random.seed(42)
n = 1000
df = pd.DataFrame({
"age": np.random.normal(38, 12, n).clip(18, 70),
"income": np.random.exponential(55000, n).clip(15000, 200000),
"credit": np.random.normal(680, 80, n).clip(300, 850),
"education": np.random.choice(["high_school","bachelor","master","phd"], n, p=[0.3,0.4,0.2,0.1]),
"employment": np.random.choice(["full-time","part-time","unemployed"], n, p=[0.65,0.2,0.15]),
"default": np.random.choice([0, 1], n, p=[0.82, 0.18]),
})
df.loc[df.sample(60, random_state=1).index, "income"] = np.nan
df.loc[df.sample(30, random_state=2).index, "education"] = np.nan
X = df.drop("default", axis=1)
y = df["default"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# PREPROCESSOR (same as previous topic)
numeric_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", RobustScaler()),
])
categorical_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")),
])
preprocessor = ColumnTransformer([
("num", numeric_pipeline, ["age", "income", "credit"]),
("cat", categorical_pipeline, ["education", "employment"]),
])
# FULL PIPELINE: PREPROCESSOR + MODEL
full_pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)),
])
# TRAIN -- one call handles all preprocessing + training
full_pipeline.fit(X_train, y_train)
# CROSS-VALIDATE on raw (unprocessed) X_train -- pipeline handles splitting correctly
cv_scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring="roc_auc")
print(f"CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
# PREDICT on raw test data -- pipeline preprocesses automatically
y_pred = full_pipeline.predict(X_test)
print("\nTest set results:")
print(classification_report(y_test, y_pred, target_names=["No Default", "Default"]))
# SWAP MODELS EASILY -- same pipeline, different final step
lr_pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", LogisticRegression(class_weight="balanced", max_iter=1000, random_state=42)),
])
lr_pipeline.fit(X_train, y_train)
print(f"\nLogistic Regression Pipeline CV AUC: {cross_val_score(lr_pipeline, X_train, y_train, cv=5, scoring='roc_auc').mean():.3f}")
# SAVE THE ENTIRE PIPELINE
joblib.dump(full_pipeline, "default_prediction_pipeline.joblib")
# LOAD AND PREDICT ON NEW DATA (in production)
loaded_pipeline = joblib.load("default_prediction_pipeline.joblib")
new_customer = pd.DataFrame([{
"age": 32, "income": 55000, "credit": 680,
"education": "bachelor", "employment": "full-time"
}])
prob_default = loaded_pipeline.predict_proba(new_customer)[0, 1]
print(f"\nNew customer default probability: {prob_default:.1%}")Tip
Tip
Practice Building a Full Preprocessing Pipeline in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
80% of ML work is data preparation — garbage in = garbage out
Practice Task
Note
Practice Task — (1) Write a working example of Building a Full Preprocessing Pipeline from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Building a Full Preprocessing Pipeline is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.