The ML Workflow — From Problem to Prediction

Every ML project follows the same workflow regardless of domain. Skipping or rushing any step leads to poor models in production. The most underestimated step is problem framing — defining the right target variable and success metric. Data preparation typically takes 60-80% of total project time.

20 min•By Priygop Team•Updated 2026

Complete ML Workflow

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# THE 7-STEP ML WORKFLOW
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# STEP 1: DEFINE THE PROBLEM
problem_framing = {
    "business_problem": "Reduce credit card fraud losses",
    "ml_task":          "Binary classification (fraud=1 / legitimate=0)",
    "target_variable":  "is_fraud",
    "success_metric":   "Recall >= 0.90 (catch 90% of fraud) at precision >= 0.70",
    "baseline":         "Random model: ~0.1% fraud rate -> always predict 0 = 99.9% accuracy but 0% fraud caught",
}

# STEP 2: COLLECT & LOAD DATA
# In real projects: SQL query, CSV, API, database
# Using synthetic data for demonstration
np.random.seed(42)
n = 5000
df = pd.DataFrame({
    "amount":         np.random.exponential(100, n),
    "hour":           np.random.randint(0, 24, n),
    "merchant_type":  np.random.choice(["retail", "online", "atm"], n),
    "distance_km":    np.random.exponential(50, n),
    "is_fraud":       np.random.choice([0, 1], n, p=[0.98, 0.02]),
})
print(f"Dataset: {df.shape} | Fraud rate: {df['is_fraud'].mean():.1%}")

# STEP 3: EXPLORE THE DATA (EDA)
print("\nData overview:")
print(df.groupby("is_fraud")[["amount", "distance_km"]].mean().round(2))

# STEP 4: PREPROCESS
X = pd.get_dummies(df.drop("is_fraud", axis=1), columns=["merchant_type"])
y = df["is_fraud"]

# STEP 5: SPLIT DATA
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"\nTrain: {len(X_train)} | Test: {len(X_test)}")

# Scale numerical features
scaler = StandardScaler()
cols_to_scale = ["amount", "hour", "distance_km"]
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
X_test[cols_to_scale]  = scaler.transform(X_test[cols_to_scale])

# STEP 6: TRAIN MODEL
model = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)
model.fit(X_train, y_train)

# STEP 7: EVALUATE
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Legitimate", "Fraud"]))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

Tip

Practice The ML Workflow From Problem to Prediction in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Machine Learning follows a structured pipeline from data to deployment

Practice Task

Note

Practice Task — (1) Write a working example of The ML Workflow From Problem to Prediction from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with The ML Workflow From Problem to Prediction is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module

The ML Workflow — From Problem to Prediction

20 min•By Priygop Team•Updated 2026

Complete ML Workflow

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# THE 7-STEP ML WORKFLOW
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# STEP 1: DEFINE THE PROBLEM
problem_framing = {
    "business_problem": "Reduce credit card fraud losses",
    "ml_task":          "Binary classification (fraud=1 / legitimate=0)",
    "target_variable":  "is_fraud",
    "success_metric":   "Recall >= 0.90 (catch 90% of fraud) at precision >= 0.70",
    "baseline":         "Random model: ~0.1% fraud rate -> always predict 0 = 99.9% accuracy but 0% fraud caught",
}

# STEP 2: COLLECT & LOAD DATA
# In real projects: SQL query, CSV, API, database
# Using synthetic data for demonstration
np.random.seed(42)
n = 5000
df = pd.DataFrame({
    "amount":         np.random.exponential(100, n),
    "hour":           np.random.randint(0, 24, n),
    "merchant_type":  np.random.choice(["retail", "online", "atm"], n),
    "distance_km":    np.random.exponential(50, n),
    "is_fraud":       np.random.choice([0, 1], n, p=[0.98, 0.02]),
})
print(f"Dataset: {df.shape} | Fraud rate: {df['is_fraud'].mean():.1%}")

# STEP 3: EXPLORE THE DATA (EDA)
print("\nData overview:")
print(df.groupby("is_fraud")[["amount", "distance_km"]].mean().round(2))

# STEP 4: PREPROCESS
X = pd.get_dummies(df.drop("is_fraud", axis=1), columns=["merchant_type"])
y = df["is_fraud"]

# STEP 5: SPLIT DATA
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"\nTrain: {len(X_train)} | Test: {len(X_test)}")

# Scale numerical features
scaler = StandardScaler()
cols_to_scale = ["amount", "hour", "distance_km"]
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
X_test[cols_to_scale]  = scaler.transform(X_test[cols_to_scale])

# STEP 6: TRAIN MODEL
model = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)
model.fit(X_train, y_train)

# STEP 7: EVALUATE
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Legitimate", "Fraud"]))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")

Tip

Diagram

Loading diagram…

Machine Learning follows a structured pipeline from data to deployment

Topics in This Module