Mini Project: Production-Ready Preprocessing Pipeline
Build a production-ready preprocessing pipeline for a loan default prediction dataset: handle missing values, encode categoricals, scale numerics, balance classes, cross-validate on the full pipeline, and package for deployment. This is the pattern used at every ML team in industry.
Complete Preprocessing Pipeline for Loan Default
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, roc_auc_score, f1_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
import joblib
np.random.seed(42)
N = 3000
# REALISTIC LOAN DATASET
raw_df = pd.DataFrame({
"age": np.random.normal(40, 12, N).clip(18, 75),
"annual_income":np.random.exponential(55000, N).clip(15000, 300000),
"credit_score": np.random.normal(680, 85, N).clip(300, 850),
"loan_amount": np.random.exponential(18000, N).clip(1000, 100000),
"loan_term": np.random.choice([12, 24, 36, 48, 60], N),
"employment": np.random.choice(["full-time","part-time","self-employed","unemployed"], N, p=[0.6,0.15,0.15,0.1]),
"purpose": np.random.choice(["home","car","education","medical","personal"], N),
"default": np.random.choice([0, 1], N, p=[0.85, 0.15]),
})
# Inject missing values
raw_df.loc[raw_df.sample(200, random_state=1).index, "annual_income"] = np.nan
raw_df.loc[raw_df.sample(100, random_state=2).index, "credit_score"] = np.nan
raw_df.loc[raw_df.sample(80, random_state=3).index, "employment"] = np.nan
X = raw_df.drop("default", axis=1)
y = raw_df["default"]
# DEFINE COLUMN GROUPS
numeric_cols = ["age", "annual_income", "credit_score", "loan_amount", "loan_term"]
categorical_cols = ["employment", "purpose"]
# BUILD PIPELINES
num_pipe = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])
cat_pipe = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")),
])
preprocessor = ColumnTransformer([
("num", num_pipe, numeric_cols),
("cat", cat_pipe, categorical_cols),
])
# FULL IMBALANCED-AWARE PIPELINE
full_pipe = ImbPipeline([
("preprocessor", preprocessor),
("smote", SMOTE(sampling_strategy=0.4, random_state=42)),
("classifier", GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=4, random_state=42)),
])
# CROSS-VALIDATE
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorers = {
"roc_auc": make_scorer(roc_auc_score, needs_proba=True),
"f1": make_scorer(f1_score, pos_label=1),
}
cv_results = cross_validate(full_pipe, X, y, cv=cv, scoring=scorers, return_train_score=True)
print("Production Pipeline Cross-Validation Results:")
print(f" AUC-ROC: {cv_results['test_roc_auc'].mean():.3f} +/- {cv_results['test_roc_auc'].std():.3f}")
print(f" F1: {cv_results['test_f1'].mean():.3f} +/- {cv_results['test_f1'].std():.3f}")
print(f" Train AUC: {cv_results['train_roc_auc'].mean():.3f} (vs test: check for overfitting)")
# SAVE PRODUCTION PIPELINE
full_pipe.fit(X, y)
joblib.dump(full_pipe, "loan_default_pipeline_v1.joblib")
print("\nPipeline saved to loan_default_pipeline_v1.joblib")
# PRODUCTION INFERENCE
loaded = joblib.load("loan_default_pipeline_v1.joblib")
new_applicant = pd.DataFrame([{
"age": 29, "annual_income": 48000, "credit_score": 640,
"loan_amount": 15000, "loan_term": 36,
"employment": "full-time", "purpose": "car"
}])
default_prob = loaded.predict_proba(new_applicant)[0, 1]
decision = "APPROVE" if default_prob < 0.20 else "REVIEW" if default_prob < 0.40 else "DECLINE"
print(f"\nApplicant default risk: {default_prob:.1%} -> Decision: {decision}")Tip
Tip
Practice Mini Project ProductionReady Preprocessing Pipeline in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
80% of ML work is data preparation — garbage in = garbage out
Practice Task
Note
Practice Task — (1) Write a working example of Mini Project ProductionReady Preprocessing Pipeline from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Mini Project ProductionReady Preprocessing Pipeline is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.