Mini Project: Production-Ready Preprocessing Pipeline

Build a production-ready preprocessing pipeline for a loan default prediction dataset: handle missing values, encode categoricals, scale numerics, balance classes, cross-validate on the full pipeline, and package for deployment. This is the pattern used at every ML team in industry.

45 min•By Priygop Team•Updated 2026

Complete Preprocessing Pipeline for Loan Default

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, roc_auc_score, f1_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
import joblib

np.random.seed(42)
N = 3000

# REALISTIC LOAN DATASET
raw_df = pd.DataFrame({
    "age":          np.random.normal(40, 12, N).clip(18, 75),
    "annual_income":np.random.exponential(55000, N).clip(15000, 300000),
    "credit_score": np.random.normal(680, 85, N).clip(300, 850),
    "loan_amount":  np.random.exponential(18000, N).clip(1000, 100000),
    "loan_term":    np.random.choice([12, 24, 36, 48, 60], N),
    "employment":   np.random.choice(["full-time","part-time","self-employed","unemployed"], N, p=[0.6,0.15,0.15,0.1]),
    "purpose":      np.random.choice(["home","car","education","medical","personal"], N),
    "default":      np.random.choice([0, 1], N, p=[0.85, 0.15]),
})
# Inject missing values
raw_df.loc[raw_df.sample(200, random_state=1).index, "annual_income"] = np.nan
raw_df.loc[raw_df.sample(100, random_state=2).index, "credit_score"]  = np.nan
raw_df.loc[raw_df.sample(80, random_state=3).index, "employment"]     = np.nan

X = raw_df.drop("default", axis=1)
y = raw_df["default"]

# DEFINE COLUMN GROUPS
numeric_cols     = ["age", "annual_income", "credit_score", "loan_amount", "loan_term"]
categorical_cols = ["employment", "purpose"]

# BUILD PIPELINES
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
])
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")),
])
preprocessor = ColumnTransformer([
    ("num", num_pipe, numeric_cols),
    ("cat", cat_pipe, categorical_cols),
])

# FULL IMBALANCED-AWARE PIPELINE
full_pipe = ImbPipeline([
    ("preprocessor", preprocessor),
    ("smote",        SMOTE(sampling_strategy=0.4, random_state=42)),
    ("classifier",   GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=4, random_state=42)),
])

# CROSS-VALIDATE
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorers = {
    "roc_auc": make_scorer(roc_auc_score, needs_proba=True),
    "f1":      make_scorer(f1_score, pos_label=1),
}

cv_results = cross_validate(full_pipe, X, y, cv=cv, scoring=scorers, return_train_score=True)

print("Production Pipeline Cross-Validation Results:")
print(f"  AUC-ROC: {cv_results['test_roc_auc'].mean():.3f} +/- {cv_results['test_roc_auc'].std():.3f}")
print(f"  F1:      {cv_results['test_f1'].mean():.3f} +/- {cv_results['test_f1'].std():.3f}")
print(f"  Train AUC: {cv_results['train_roc_auc'].mean():.3f} (vs test: check for overfitting)")

# SAVE PRODUCTION PIPELINE
full_pipe.fit(X, y)
joblib.dump(full_pipe, "loan_default_pipeline_v1.joblib")
print("\nPipeline saved to loan_default_pipeline_v1.joblib")

# PRODUCTION INFERENCE
loaded = joblib.load("loan_default_pipeline_v1.joblib")
new_applicant = pd.DataFrame([{
    "age": 29, "annual_income": 48000, "credit_score": 640,
    "loan_amount": 15000, "loan_term": 36,
    "employment": "full-time", "purpose": "car"
}])
default_prob = loaded.predict_proba(new_applicant)[0, 1]
decision = "APPROVE" if default_prob < 0.20 else "REVIEW" if default_prob < 0.40 else "DECLINE"
print(f"\nApplicant default risk: {default_prob:.1%} -> Decision: {decision}")

Tip

Practice Mini Project ProductionReady Preprocessing Pipeline in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

80% of ML work is data preparation — garbage in = garbage out

Practice Task

Note

Practice Task — (1) Write a working example of Mini Project ProductionReady Preprocessing Pipeline from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Mini Project ProductionReady Preprocessing Pipeline is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module