Pipeline Versioning & Experiment Tracking
As ML projects mature, you'll run 50+ experiments with different models, features, and hyperparameters. Manually tracking results in spreadsheets fails fast. MLflow is the industry standard for experiment tracking: log parameters, metrics, artefacts (models, plots), and compare runs in a web UI. Every production ML team uses some form of experiment tracking.
MLflow Experiment Tracking
import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, f1_score, classification_report
import matplotlib.pyplot as plt
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# SET UP MLFLOW EXPERIMENT
mlflow.set_experiment("breast_cancer_classification")
def run_experiment(model_name: str, model, params: dict) -> dict:
"""Train model, log to MLflow, return metrics."""
with mlflow.start_run(run_name=model_name):
# LOG PARAMETERS
mlflow.log_params(params)
mlflow.log_param("n_train", len(X_train))
mlflow.log_param("n_features", X_train.shape[1])
# TRAIN + EVALUATE
pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
cv_auc = cross_val_score(pipe, X_train, y_train, cv=5, scoring="roc_auc").mean()
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
y_prob = pipe.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, y_prob)
test_f1 = f1_score(y_test, y_pred)
# LOG METRICS
mlflow.log_metric("cv_auc", cv_auc)
mlflow.log_metric("test_auc", test_auc)
mlflow.log_metric("test_f1", test_f1)
# LOG MODEL
mlflow.sklearn.log_model(pipe, "model")
# LOG ARTIFACTS (plots, etc.)
# mlflow.log_artifact("confusion_matrix.png")
print(f" {model_name:25s}: CV AUC={cv_auc:.4f} | Test AUC={test_auc:.4f}")
return {"model_name": model_name, "cv_auc": cv_auc, "test_auc": test_auc}
# RUN MULTIPLE EXPERIMENTS
experiments = [
("Logistic Regression", LogisticRegression(C=1.0, max_iter=1000, random_state=42), {"C": 1.0}),
("Logistic Reg (C=0.1)", LogisticRegression(C=0.1, max_iter=1000, random_state=42), {"C": 0.1}),
("Random Forest", RandomForestClassifier(n_estimators=100, random_state=42), {"n_estimators": 100}),
("GBM (lr=0.1)", GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, random_state=42), {"n_estimators": 200, "lr": 0.1}),
("GBM (lr=0.05)", GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, random_state=42),{"n_estimators": 300, "lr": 0.05}),
]
print("Running experiments:")
results = []
for name, model, params in experiments:
result = run_experiment(name, model, params)
results.append(result)
results_df = pd.DataFrame(results).sort_values("test_auc", ascending=False)
print("\nAll experiments ranked by test AUC:")
print(results_df.round(4).to_string(index=False))
print("\nMLflow server: run 'mlflow ui' in terminal, then open http://localhost:5000")
print("Alternatively: mlflow.set_tracking_uri('sqlite:///mlflow.db') for persistent storage")
# WITHOUT MLFLOW: manual experiment tracking pattern
print("\nManual tracking (no mlflow) -- use when MLflow not available:")
print("""
experiment_log = []
for params in param_grid:
model = build_model(**params)
cv_score = cross_val_score(model, X_train, y_train, cv=5).mean()
experiment_log.append({**params, 'cv_score': cv_score, 'timestamp': datetime.now()})
df_log = pd.DataFrame(experiment_log)
df_log.to_csv('experiments.csv', index=False)
""")Tip
Tip
Practice Pipeline Versioning Experiment Tracking in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
80% of ML work is data preparation — garbage in = garbage out
Practice Task
Note
Practice Task — (1) Write a working example of Pipeline Versioning Experiment Tracking from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Pipeline Versioning Experiment Tracking is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.