Correlation Analysis — Feature Relationships
Correlation measures the linear relationship between two features. High correlation between two features means they carry redundant information — keeping both won't help the model and may cause issues (multicollinearity in linear models). High correlation between a feature and the target means it is predictive — prioritize it. Pearson for linear, Spearman for monotonic (non-linear).
Correlation Heatmaps and Feature-Target Correlation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df["target"] = cancer.target
# Select first 10 features for clarity
features_10 = list(cancer.feature_names[:10]) + ["target"]
df_10 = df[features_10]
# PEARSON CORRELATION MATRIX
corr_matrix = df_10.corr(method="pearson")
fig, axes = plt.subplots(1, 2, figsize=(18, 7))
# Full heatmap
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) # upper triangle mask
sns.heatmap(
corr_matrix, ax=axes[0], mask=mask,
annot=True, fmt=".2f", cmap="RdYlGn", center=0,
vmin=-1, vmax=1, linewidths=0.5,
cbar_kws={"label": "Pearson Correlation"},
)
axes[0].set_title("Feature Correlation Matrix\n(lower triangle only)")
# Feature-target correlation bar chart
target_corr = corr_matrix["target"].drop("target").sort_values()
colors = ["tomato" if v < 0 else "steelblue" for v in target_corr]
target_corr.plot(kind="barh", ax=axes[1], color=colors)
axes[1].axvline(0, color="black", linewidth=0.8)
axes[1].set_title("Correlation with Target")
axes[1].set_xlabel("Pearson Correlation")
plt.tight_layout()
plt.savefig("correlations.png", dpi=100, bbox_inches="tight")
plt.show()
# IDENTIFY HIGHLY CORRELATED FEATURE PAIRS (potential redundancy)
print("Highly correlated feature pairs (|r| > 0.85):")
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
r = corr_matrix.iloc[i, j]
if abs(r) > 0.85 and corr_matrix.columns[i] != "target":
high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], round(r, 3)))
for feat1, feat2, r in sorted(high_corr, key=lambda x: abs(x[2]), reverse=True):
print(f" {feat1:20s} <-> {feat2:20s}: r = {r:+.3f}")
# SPEARMAN CORRELATION -- handles non-linear monotonic relationships
spearman_corr = df_10.corr(method="spearman")
target_spearman = spearman_corr["target"].drop("target").sort_values()
print("\nTop 5 features by Spearman correlation with target:")
print(target_spearman.abs().sort_values(ascending=False).head(5).round(3))Tip
Tip
Practice Correlation Analysis Feature Relationships in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Feature engineering = 80% of ML success.
Practice Task
Note
Practice Task — (1) Write a working example of Correlation Analysis Feature Relationships from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Correlation Analysis Feature Relationships is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.