Model Serialization & Versioning
Before deploying, you must serialize (save) the trained model and all its preprocessing steps as a single unit. Joblib is the standard for sklearn pipelines. For production, every model version should carry metadata: train date, dataset version, evaluation metrics, feature names. This enables rollback, debugging, and compliance audits. Never serialize just the model — always serialize the full pipeline.
Joblib Serialization, Versioning, and Model Cards
import joblib
import json
import datetime
import hashlib
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score, classification_report
np.random.seed(42)
N = 2000
df = pd.DataFrame({
"age": np.random.normal(38, 12, N).clip(18, 75),
"income": np.random.exponential(55000, N).clip(15000, 300000),
"credit": np.random.normal(680, 80, N).clip(300, 850),
"loan_amt": np.random.exponential(18000, N).clip(1000, 80000),
"educ": np.random.choice(["hs","bachelor","master","phd"], N, p=[0.3,0.4,0.2,0.1]),
"default": np.random.choice([0,1], N, p=[0.83,0.17]),
})
X = df.drop("default", axis=1)
y = df["default"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# BUILD AND TRAIN PIPELINE
num_cols = ["age", "income", "credit", "loan_amt"]
cat_cols = ["educ"]
preprocessor = ColumnTransformer([
("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("pt", PowerTransformer())]), num_cols),
("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("ohe", OneHotEncoder(drop="first", sparse_output=False, handle_unknown="ignore"))]), cat_cols),
])
pipeline = Pipeline([
("prep", preprocessor),
("model", GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42)),
])
pipeline.fit(X_train, y_train)
y_prob = pipeline.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, y_prob)
cv_auc = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="roc_auc").mean()
# CREATE A MODEL CARD (metadata for governance)
model_card = {
"model_name": "credit_default_classifier",
"version": "1.0.0",
"algorithm": "GradientBoostingClassifier",
"created_at": datetime.datetime.now().isoformat(),
"train_samples": len(X_train),
"test_samples": len(X_test),
"features": num_cols + cat_cols,
"target": "default",
"metrics": {
"cv_auc_5fold": round(cv_auc, 4),
"test_auc": round(test_auc, 4),
},
"sklearn_version": "1.2+",
"preprocessing": "PowerTransformer (numeric), OneHotEncoder (categorical)",
"author": "ML Team",
"intended_use": "Predict loan default probability for credit scoring",
"limitations": "Trained on simulated data. Retrain on real data before production use.",
}
# SERIALIZE MODEL + CARD
model_path = "models/credit_classifier_v1.0.0.joblib"
card_path = "models/credit_classifier_v1.0.0.json"
joblib.dump(pipeline, "credit_classifier_v1.0.0.joblib")
with open("credit_classifier_v1.0.0.json", "w") as f:
json.dump(model_card, f, indent=2)
# CREATE A CHECKSUM for integrity verification
model_bytes = open("credit_classifier_v1.0.0.joblib", "rb").read()
sha256_hash = hashlib.sha256(model_bytes).hexdigest()
model_card["sha256"] = sha256_hash
print(f"Model saved | SHA256: {sha256_hash[:16]}...")
print(f"Test AUC: {test_auc:.4f}")
# LOAD AND VERIFY
loaded_pipeline = joblib.load("credit_classifier_v1.0.0.joblib")
loaded_auc = roc_auc_score(y_test, loaded_pipeline.predict_proba(X_test)[:, 1])
print(f"Loaded model AUC: {loaded_auc:.4f} (should match {test_auc:.4f})")
# MODEL REGISTRY PATTERN (without MLflow)
registry_entry = {
"versions": [
{"version": "1.0.0", "status": "production", "metrics": model_card["metrics"], "path": model_path},
# Future versions:
# {"version": "1.1.0", "status": "staging", "metrics": {...}, "path": "..."},
]
}
print("\nModel registry pattern:")
print(json.dumps(registry_entry, indent=2))Tip
Tip
Practice Model Serialization Versioning in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Model Serialization Versioning from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Model Serialization Versioning is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.