Handling Skewed Targets in Regression
Not just features can be skewed — regression targets often are too (house prices, salaries, transaction amounts). Predicting the log-transformed target (and reversing the transformation at inference) can dramatically improve model performance for right-skewed targets. Box-Cox is more systematic but requires positive targets, Yeo-Johnson handles zeros and negatives.
Target Transformation for Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
housing = fetch_california_housing()
X, y = housing.data, housing.target
print(f"Target skewness: {pd.Series(y).skew():.3f}")
print(f"Target range: [{y.min():.2f}, {y.max():.2f}]")
print(f"Mean: {y.mean():.2f}, Median: {np.median(y):.2f}")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# COMPARE: predict raw vs log-transformed target
ridge_raw = Pipeline([("sc", StandardScaler()), ("m", Ridge())])
ridge_raw.fit(X_train, y_train)
r2_raw = r2_score(y_test, ridge_raw.predict(X_test))
mae_raw = mean_absolute_error(y_test, ridge_raw.predict(X_test))
# LOG TRANSFORM THE TARGET
y_train_log = np.log1p(y_train)
y_test_log = np.log1p(y_test)
ridge_log = Pipeline([("sc", StandardScaler()), ("m", Ridge())])
ridge_log.fit(X_train, y_train_log)
y_pred_log = ridge_log.predict(X_test)
y_pred_back = np.expm1(y_pred_log) # reverse: expm1 is exp(x)-1
r2_log = r2_score(y_test, y_pred_back)
mae_log = mean_absolute_error(y_test, y_pred_back)
print(f"\nRidge on raw target: R2={r2_raw:.4f} | MAE={mae_raw:.4f}")
print(f"Ridge on log(target): R2={r2_log:.4f} | MAE={mae_log:.4f} (+{r2_log-r2_raw:.4f})")
# TRANSFOREMED REGRESSOR (sklearn >= 0.23) -- handles transformation inside pipeline
from sklearn.compose import TransformedTargetRegressor
ridge_ttr = TransformedTargetRegressor(
regressor=Pipeline([("sc", StandardScaler()), ("m", Ridge())]),
func=np.log1p,
inverse_func=np.expm1,
)
ridge_ttr.fit(X_train, y_train)
r2_ttr = r2_score(y_test, ridge_ttr.predict(X_test))
print(f"TransformedTargetRegressor: R2={r2_ttr:.4f}")
# GBM: less sensitive to target skewness
gb_raw = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42)
gb_raw.fit(X_train, y_train)
gb_log = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42)
gb_log.fit(X_train, y_train_log)
print(f"\nGBM on raw target: R2={r2_score(y_test, gb_raw.predict(X_test)):.4f}")
print(f"GBM on log(target): R2={r2_score(y_test, np.expm1(gb_log.predict(X_test))):.4f}")
print("Note: Tree models are less sensitive to target skewness than linear models")
# WHICH LOSS FUNCTION FOR SKEWED TARGETS
loss_functions = {
"MSE (default)": "Squares errors -> large errors dominate -> bad for skewed targets",
"MAE (huber)": "Sum of |errors| -> robust to outlier targets",
"MSLE": "Mean Squared Log Error -- penalizes relative errors equally",
"Quantile loss": "Predict specific quantile (e.g., 90th percentile for safety margins)",
}
print("\nLoss function guide for skewed regression targets:")
for loss, when in loss_functions.items():
print(f" {loss:20s}: {when}")Tip
Tip
Practice Handling Skewed Targets in Regression in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Simplest ML. y = mx + b. Minimize MSE.
Practice Task
Note
Practice Task — (1) Write a working example of Handling Skewed Targets in Regression from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Handling Skewed Targets in Regression is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.