Residual Analysis — Diagnosing Linear Regression

Residuals (actual - predicted) reveal whether the linear regression assumptions hold. Patterns in residuals indicate model problems: a fan shape means heteroscedasticity (unstable variance), a curve means non-linearity (add polynomial features), systematic outliers mean data issues. Proper residual analysis prevents deploying a model with hidden flaws.

20 min•By Priygop Team•Updated 2026

Residual Plots and Diagnostics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import stats

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
model = LinearRegression()
model.fit(scaler.fit_transform(X_train), y_train)
y_pred = model.predict(scaler.transform(X_test))
residuals = y_test - y_pred

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# PLOT 1: Residuals vs Fitted Values
axes[0, 0].scatter(y_pred, residuals, alpha=0.3, color="steelblue", s=20)
axes[0, 0].axhline(0, color="red", linestyle="--", linewidth=2)
axes[0, 0].set_xlabel("Fitted Values")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Residuals vs Fitted\n(should be random scatter around 0)")

# PLOT 2: Distribution of residuals (should be roughly normal)
axes[0, 1].hist(residuals, bins=50, color="steelblue", edgecolor="white", density=True)
x_range = np.linspace(residuals.min(), residuals.max(), 100)
axes[0, 1].plot(x_range, stats.norm.pdf(x_range, residuals.mean(), residuals.std()),
                color="red", linewidth=2, label="Normal fit")
axes[0, 1].set_title("Residual Distribution\n(should be bell-shaped)")
axes[0, 1].legend()

# PLOT 3: Q-Q plot (normality check)
stats.probplot(residuals, plot=axes[1, 0])
axes[1, 0].set_title("Q-Q Plot\n(points on diagonal = normal residuals)")

# PLOT 4: Actual vs Predicted
axes[1, 1].scatter(y_test, y_pred, alpha=0.3, color="steelblue", s=20)
min_val, max_val = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
axes[1, 1].plot([min_val, max_val], [min_val, max_val], "r--", linewidth=2, label="Perfect prediction")
axes[1, 1].set_xlabel("Actual Values")
axes[1, 1].set_ylabel("Predicted Values")
axes[1, 1].set_title("Actual vs Predicted\n(diagonal = perfect model)")
axes[1, 1].legend()

plt.tight_layout()
plt.savefig("residual_plots.png", dpi=100, bbox_inches="tight")
plt.show()

# RESIDUAL STATISTICS
print("Residual diagnostics:")
print(f"  Mean:        {residuals.mean():.4f}  (should be ~0)")
print(f"  Std:         {residuals.std():.4f}")
print(f"  Skewness:    {pd.Series(residuals).skew():.3f}  (|<0.5| = good, >1.0 = problematic)")
print(f"  Max error:   {abs(residuals).max():.3f}  (worst single prediction)")
print(f"  Within 0.5:  {(abs(residuals) < 0.5).mean():.1%}  (predictions within $50k)")

# HETEROSCEDASTICITY DETECTION -- Breusch-Pagan-like test
corr_w_residuals = np.corrcoef(y_pred, residuals**2)[0, 1]
print(f"\nResiduals^2 vs fitted correlation: {corr_w_residuals:.3f}")
print("  (near 0 = homoscedastic, large value = heteroscedastic -> consider log(y))")

Tip

Practice Residual Analysis Diagnosing Linear Regression in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Simplest ML. y = mx + b. Minimize MSE.

Practice Task

Note

Practice Task — (1) Write a working example of Residual Analysis Diagnosing Linear Regression from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Residual Analysis Diagnosing Linear Regression is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module