The ML Workflow — From Problem to Prediction
Every ML project follows the same workflow regardless of domain. Skipping or rushing any step leads to poor models in production. The most underestimated step is problem framing — defining the right target variable and success metric. Data preparation typically takes 60-80% of total project time.
Complete ML Workflow
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
# THE 7-STEP ML WORKFLOW
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# STEP 1: DEFINE THE PROBLEM
problem_framing = {
"business_problem": "Reduce credit card fraud losses",
"ml_task": "Binary classification (fraud=1 / legitimate=0)",
"target_variable": "is_fraud",
"success_metric": "Recall >= 0.90 (catch 90% of fraud) at precision >= 0.70",
"baseline": "Random model: ~0.1% fraud rate -> always predict 0 = 99.9% accuracy but 0% fraud caught",
}
# STEP 2: COLLECT & LOAD DATA
# In real projects: SQL query, CSV, API, database
# Using synthetic data for demonstration
np.random.seed(42)
n = 5000
df = pd.DataFrame({
"amount": np.random.exponential(100, n),
"hour": np.random.randint(0, 24, n),
"merchant_type": np.random.choice(["retail", "online", "atm"], n),
"distance_km": np.random.exponential(50, n),
"is_fraud": np.random.choice([0, 1], n, p=[0.98, 0.02]),
})
print(f"Dataset: {df.shape} | Fraud rate: {df['is_fraud'].mean():.1%}")
# STEP 3: EXPLORE THE DATA (EDA)
print("\nData overview:")
print(df.groupby("is_fraud")[["amount", "distance_km"]].mean().round(2))
# STEP 4: PREPROCESS
X = pd.get_dummies(df.drop("is_fraud", axis=1), columns=["merchant_type"])
y = df["is_fraud"]
# STEP 5: SPLIT DATA
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"\nTrain: {len(X_train)} | Test: {len(X_test)}")
# Scale numerical features
scaler = StandardScaler()
cols_to_scale = ["amount", "hour", "distance_km"]
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
X_test[cols_to_scale] = scaler.transform(X_test[cols_to_scale])
# STEP 6: TRAIN MODEL
model = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)
model.fit(X_train, y_train)
# STEP 7: EVALUATE
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Legitimate", "Fraud"]))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")Tip
Tip
Practice The ML Workflow From Problem to Prediction in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Machine Learning follows a structured pipeline from data to deployment
Practice Task
Note
Practice Task — (1) Write a working example of The ML Workflow From Problem to Prediction from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with The ML Workflow From Problem to Prediction is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.