Building a Full Preprocessing Pipeline

A scikit-learn Pipeline chains preprocessing and model training into a single object. This eliminates data leakage (fit always uses training data only), enables cross-validation on the full pipeline, and makes deployment a single-step operation. The Pipeline is the single most important scikit-learn pattern for production ML.

20 min•By Priygop Team•Updated 2026

End-to-End Pipeline: Preprocessing + Model

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib

np.random.seed(42)
n = 1000
df = pd.DataFrame({
    "age":        np.random.normal(38, 12, n).clip(18, 70),
    "income":     np.random.exponential(55000, n).clip(15000, 200000),
    "credit":     np.random.normal(680, 80, n).clip(300, 850),
    "education":  np.random.choice(["high_school","bachelor","master","phd"], n, p=[0.3,0.4,0.2,0.1]),
    "employment": np.random.choice(["full-time","part-time","unemployed"], n, p=[0.65,0.2,0.15]),
    "default":    np.random.choice([0, 1], n, p=[0.82, 0.18]),
})
df.loc[df.sample(60, random_state=1).index, "income"] = np.nan
df.loc[df.sample(30, random_state=2).index, "education"] = np.nan

X = df.drop("default", axis=1)
y = df["default"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# PREPROCESSOR (same as previous topic)
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  RobustScaler()),
])
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")),
])
preprocessor = ColumnTransformer([
    ("num", numeric_pipeline,     ["age", "income", "credit"]),
    ("cat", categorical_pipeline, ["education", "employment"]),
])

# FULL PIPELINE: PREPROCESSOR + MODEL
full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",   RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)),
])

# TRAIN -- one call handles all preprocessing + training
full_pipeline.fit(X_train, y_train)

# CROSS-VALIDATE on raw (unprocessed) X_train -- pipeline handles splitting correctly
cv_scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring="roc_auc")
print(f"CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

# PREDICT on raw test data -- pipeline preprocesses automatically
y_pred = full_pipeline.predict(X_test)
print("\nTest set results:")
print(classification_report(y_test, y_pred, target_names=["No Default", "Default"]))

# SWAP MODELS EASILY -- same pipeline, different final step
lr_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",   LogisticRegression(class_weight="balanced", max_iter=1000, random_state=42)),
])
lr_pipeline.fit(X_train, y_train)
print(f"\nLogistic Regression Pipeline CV AUC: {cross_val_score(lr_pipeline, X_train, y_train, cv=5, scoring='roc_auc').mean():.3f}")

# SAVE THE ENTIRE PIPELINE
joblib.dump(full_pipeline, "default_prediction_pipeline.joblib")

# LOAD AND PREDICT ON NEW DATA (in production)
loaded_pipeline = joblib.load("default_prediction_pipeline.joblib")
new_customer = pd.DataFrame([{
    "age": 32, "income": 55000, "credit": 680,
    "education": "bachelor", "employment": "full-time"
}])
prob_default = loaded_pipeline.predict_proba(new_customer)[0, 1]
print(f"\nNew customer default probability: {prob_default:.1%}")

Tip

Practice Building a Full Preprocessing Pipeline in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

80% of ML work is data preparation — garbage in = garbage out

Practice Task

Note

Practice Task — (1) Write a working example of Building a Full Preprocessing Pipeline from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Building a Full Preprocessing Pipeline is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module