Train-Test Split & Avoiding Data Leakage
The train-test split is the most fundamental evaluation technique in ML. Training data teaches the model. Test data simulates new, unseen data — your honest estimate of real-world performance. Never let test data influence any step of training (preprocessing, feature selection, model tuning). Data leakage is the single most common way ML models fail silently in production.
Correct Train-Test Splitting
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# BASIC SPLIT
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.20, # 20% for testing, 80% for training
random_state=42, # reproducibility: same split every run
shuffle=True, # shuffle before splitting (default=True)
)
# For classification with imbalanced classes: add stratify=y
# train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train: {X_train.shape} | Test: {X_test.shape}")
print(f"Train target mean: {y_train.mean():.3f} | Test target mean: {y_test.mean():.3f}")
# THREE SETS: Train / Validation / Test
# Use when tuning hyperparameters to avoid test set contamination
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.176, random_state=42)
# 0.176 of 85% ~= 15% of total -> 70% train / 15% val / 15% test
print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")
# COMMON DATA LEAKAGE PATTERNS
leakage_examples = {
"Scaling before split": "fit_transform on whole dataset -> test stats leak into scaler",
"Feature selection on all data": "selecting features based on full dataset correlation -> test labels influence features",
"Target encoding without CV": "encode categories using target mean on full dataset -> target leaks into features",
"Temporal data shuffled": "shuffling time-series -> future data leaks into past training window",
"Duplicate rows split apart": "same sample in both train and test -> inflated accuracy",
}
print("\nData Leakage Types:")
for mistake, explanation in leakage_examples.items():
print(f" LEAK: {mistake}")
print(f" {explanation}\n")
# CORRECT APPROACH:
scaler = StandardScaler()
# FIT ONLY on training data
X_train_sc = scaler.fit_transform(X_train)
# APPLY (no fit) to validation and test
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)
model = Ridge(alpha=1.0)
model.fit(X_train_sc, y_train)
print(f"Validation R2: {model.score(X_val_sc, y_val):.4f}")
print(f"Test R2: {model.score(X_test_sc, y_test):.4f}")Tip
Tip
Practice TrainTest Split Avoiding Data Leakage in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Faker for synthetic. Mask PII.
Practice Task
Note
Practice Task — (1) Write a working example of TrainTest Split Avoiding Data Leakage from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with TrainTest Split Avoiding Data Leakage is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.