Pipeline Versioning & Experiment Tracking

As ML projects mature, you'll run 50+ experiments with different models, features, and hyperparameters. Manually tracking results in spreadsheets fails fast. MLflow is the industry standard for experiment tracking: log parameters, metrics, artefacts (models, plots), and compare runs in a web UI. Every production ML team uses some form of experiment tracking.

15 min•By Priygop Team•Updated 2026

MLflow Experiment Tracking

import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, f1_score, classification_report
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# SET UP MLFLOW EXPERIMENT
mlflow.set_experiment("breast_cancer_classification")

def run_experiment(model_name: str, model, params: dict) -> dict:
    """Train model, log to MLflow, return metrics."""
    with mlflow.start_run(run_name=model_name):
        # LOG PARAMETERS
        mlflow.log_params(params)
        mlflow.log_param("n_train", len(X_train))
        mlflow.log_param("n_features", X_train.shape[1])

        # TRAIN + EVALUATE
        pipe = Pipeline([("scaler", StandardScaler()), ("model", model)])
        cv_auc = cross_val_score(pipe, X_train, y_train, cv=5, scoring="roc_auc").mean()

        pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_test)
        y_prob = pipe.predict_proba(X_test)[:, 1]

        test_auc = roc_auc_score(y_test, y_prob)
        test_f1  = f1_score(y_test, y_pred)

        # LOG METRICS
        mlflow.log_metric("cv_auc", cv_auc)
        mlflow.log_metric("test_auc", test_auc)
        mlflow.log_metric("test_f1", test_f1)

        # LOG MODEL
        mlflow.sklearn.log_model(pipe, "model")

        # LOG ARTIFACTS (plots, etc.)
        # mlflow.log_artifact("confusion_matrix.png")

        print(f"  {model_name:25s}: CV AUC={cv_auc:.4f} | Test AUC={test_auc:.4f}")
        return {"model_name": model_name, "cv_auc": cv_auc, "test_auc": test_auc}

# RUN MULTIPLE EXPERIMENTS
experiments = [
    ("Logistic Regression",  LogisticRegression(C=1.0, max_iter=1000, random_state=42),     {"C": 1.0}),
    ("Logistic Reg (C=0.1)", LogisticRegression(C=0.1, max_iter=1000, random_state=42),     {"C": 0.1}),
    ("Random Forest",        RandomForestClassifier(n_estimators=100, random_state=42),      {"n_estimators": 100}),
    ("GBM (lr=0.1)",         GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, random_state=42), {"n_estimators": 200, "lr": 0.1}),
    ("GBM (lr=0.05)",        GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, random_state=42),{"n_estimators": 300, "lr": 0.05}),
]

print("Running experiments:")
results = []
for name, model, params in experiments:
    result = run_experiment(name, model, params)
    results.append(result)

results_df = pd.DataFrame(results).sort_values("test_auc", ascending=False)
print("\nAll experiments ranked by test AUC:")
print(results_df.round(4).to_string(index=False))

print("\nMLflow server: run 'mlflow ui' in terminal, then open http://localhost:5000")
print("Alternatively: mlflow.set_tracking_uri('sqlite:///mlflow.db') for persistent storage")

# WITHOUT MLFLOW: manual experiment tracking pattern
print("\nManual tracking (no mlflow) -- use when MLflow not available:")
print("""
experiment_log = []
for params in param_grid:
    model = build_model(**params)
    cv_score = cross_val_score(model, X_train, y_train, cv=5).mean()
    experiment_log.append({**params, 'cv_score': cv_score, 'timestamp': datetime.now()})

df_log = pd.DataFrame(experiment_log)
df_log.to_csv('experiments.csv', index=False)
""")

Tip

Practice Pipeline Versioning Experiment Tracking in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

80% of ML work is data preparation — garbage in = garbage out

Practice Task

Note

Practice Task — (1) Write a working example of Pipeline Versioning Experiment Tracking from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Pipeline Versioning Experiment Tracking is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module