Train-Test Split & Avoiding Data Leakage

The train-test split is the most fundamental evaluation technique in ML. Training data teaches the model. Test data simulates new, unseen data — your honest estimate of real-world performance. Never let test data influence any step of training (preprocessing, feature selection, model tuning). Data leakage is the single most common way ML models fail silently in production.

15 min•By Priygop Team•Updated 2026

Correct Train-Test Splitting

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# BASIC SPLIT
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,      # 20% for testing, 80% for training
    random_state=42,     # reproducibility: same split every run
    shuffle=True,        # shuffle before splitting (default=True)
)
# For classification with imbalanced classes: add stratify=y
# train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Train: {X_train.shape} | Test: {X_test.shape}")
print(f"Train target mean: {y_train.mean():.3f} | Test target mean: {y_test.mean():.3f}")

# THREE SETS: Train / Validation / Test
# Use when tuning hyperparameters to avoid test set contamination
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42)
# 0.176 of 85% ~= 15% of total -> 70% train / 15% val / 15% test
print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")

# COMMON DATA LEAKAGE PATTERNS
leakage_examples = {
    "Scaling before split": "fit_transform on whole dataset -> test stats leak into scaler",
    "Feature selection on all data": "selecting features based on full dataset correlation -> test labels influence features",
    "Target encoding without CV": "encode categories using target mean on full dataset -> target leaks into features",
    "Temporal data shuffled": "shuffling time-series -> future data leaks into past training window",
    "Duplicate rows split apart": "same sample in both train and test -> inflated accuracy",
}

print("\nData Leakage Types:")
for mistake, explanation in leakage_examples.items():
    print(f"  LEAK: {mistake}")
    print(f"        {explanation}\n")

# CORRECT APPROACH:
scaler = StandardScaler()
# FIT ONLY on training data
X_train_sc = scaler.fit_transform(X_train)
# APPLY (no fit) to validation and test
X_val_sc   = scaler.transform(X_val)
X_test_sc  = scaler.transform(X_test)

model = Ridge(alpha=1.0)
model.fit(X_train_sc, y_train)
print(f"Validation R2: {model.score(X_val_sc, y_val):.4f}")
print(f"Test R2:       {model.score(X_test_sc, y_test):.4f}")

Tip

Practice TrainTest Split Avoiding Data Leakage in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Faker for synthetic. Mask PII.

Practice Task

Note

Practice Task — (1) Write a working example of TrainTest Split Avoiding Data Leakage from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with TrainTest Split Avoiding Data Leakage is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module

Train-Test Split & Avoiding Data Leakage

15 min•By Priygop Team•Updated 2026

Correct Train-Test Splitting

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# BASIC SPLIT
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,      # 20% for testing, 80% for training
    random_state=42,     # reproducibility: same split every run
    shuffle=True,        # shuffle before splitting (default=True)
)
# For classification with imbalanced classes: add stratify=y
# train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Train: {X_train.shape} | Test: {X_test.shape}")
print(f"Train target mean: {y_train.mean():.3f} | Test target mean: {y_test.mean():.3f}")

# THREE SETS: Train / Validation / Test
# Use when tuning hyperparameters to avoid test set contamination
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42)
# 0.176 of 85% ~= 15% of total -> 70% train / 15% val / 15% test
print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")

# COMMON DATA LEAKAGE PATTERNS
leakage_examples = {
    "Scaling before split": "fit_transform on whole dataset -> test stats leak into scaler",
    "Feature selection on all data": "selecting features based on full dataset correlation -> test labels influence features",
    "Target encoding without CV": "encode categories using target mean on full dataset -> target leaks into features",
    "Temporal data shuffled": "shuffling time-series -> future data leaks into past training window",
    "Duplicate rows split apart": "same sample in both train and test -> inflated accuracy",
}

print("\nData Leakage Types:")
for mistake, explanation in leakage_examples.items():
    print(f"  LEAK: {mistake}")
    print(f"        {explanation}\n")

# CORRECT APPROACH:
scaler = StandardScaler()
# FIT ONLY on training data
X_train_sc = scaler.fit_transform(X_train)
# APPLY (no fit) to validation and test
X_val_sc   = scaler.transform(X_val)
X_test_sc  = scaler.transform(X_test)

model = Ridge(alpha=1.0)
model.fit(X_train_sc, y_train)
print(f"Validation R2: {model.score(X_val_sc, y_val):.4f}")
print(f"Test R2:       {model.score(X_test_sc, y_test):.4f}")

Tip

Diagram

Loading diagram…

Faker for synthetic. Mask PII.

Topics in This Module