PCA — Dimensionality Reduction
Principal Component Analysis finds the directions of maximum variance in the data (principal components) and projects the data onto a lower-dimensional subspace. PCA reduces features while preserving most information. Applications: remove multicollinearity before linear models, visualize high-dimensional data in 2D, speed up training, compress data. Always scale before PCA — variance-maximizing assumes equal importance.
PCA — Explained Variance, Visualization, and Feature Reduction
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
cancer = load_breast_cancer()
X = cancer.data # 30 features
y = cancer.target
# ALWAYS SCALE BEFORE PCA
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
# FIT PCA ON ALL COMPONENTS
pca_full = PCA()
pca_full.fit(X_sc)
# EXPLAINED VARIANCE RATIO
explained_var = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained_var)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Scree plot
axes[0].bar(range(1, len(explained_var)+1), explained_var, color="steelblue", alpha=0.8)
axes[0].plot(range(1, len(explained_var)+1), cumulative, "ro-", linewidth=2)
axes[0].axhline(0.95, color="green", linestyle="--", label="95% variance")
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Scree Plot -- how many PCs to keep?")
axes[0].legend()
n_95 = np.argmax(cumulative >= 0.95) + 1
n_99 = np.argmax(cumulative >= 0.99) + 1
print(f"Original features: {X.shape[1]}")
print(f"PCs to explain 95% variance: {n_95} (from {X.shape[1]})")
print(f"PCs to explain 99% variance: {n_99}")
# 2D VISUALIZATION (for understanding, not for modeling)
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_sc)
for label, color, name in [(0, "tomato", "Malignant"), (1, "steelblue", "Benign")]:
mask = y == label
axes[1].scatter(X_2d[mask, 0], X_2d[mask, 1], c=color, label=name, alpha=0.6, s=25)
axes[1].set_xlabel(f"PC1 ({explained_var[0]:.1%} variance)")
axes[1].set_ylabel(f"PC2 ({explained_var[1]:.1%} variance)")
axes[1].set_title("First 2 PCA Components\n(breast cancer data)")
axes[1].legend()
# LOADINGS: WHAT DOES PC1 MEAN?
loadings = pd.DataFrame(pca_full.components_[:3].T,
index=cancer.feature_names,
columns=["PC1", "PC2", "PC3"])
loadings["PC1_abs"] = loadings["PC1"].abs()
top_features = loadings.nlargest(10, "PC1_abs")[["PC1", "PC2"]].round(3)
top_features.plot(kind="bar", ax=axes[2], color=["steelblue", "coral"])
axes[2].set_title("PC1 & PC2 Feature Loadings\n(which features drive each PC)")
axes[2].set_xticklabels(axes[2].get_xticklabels(), rotation=30, ha="right", fontsize=8)
plt.tight_layout()
plt.savefig("pca_analysis.png", dpi=100, bbox_inches="tight")
plt.show()
# EFFECT OF PCA ON MODEL PERFORMANCE
models_to_compare = {
"RF on all 30 features": Pipeline([("model", RandomForestClassifier(n_estimators=100, random_state=42))]),
"RF on 10 PCA components": Pipeline([("sc", StandardScaler()), ("pca", PCA(n_components=10)), ("model", RandomForestClassifier(n_estimators=100, random_state=42))]),
"RF on 5 PCA components": Pipeline([("sc", StandardScaler()), ("pca", PCA(n_components=5)), ("model", RandomForestClassifier(n_estimators=100, random_state=42))]),
}
print("\nModel performance with PCA reduction:")
for name, pipeline in models_to_compare.items():
cv_acc = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc").mean()
print(f" {name:35s}: AUC = {cv_acc:.4f}")Tip
Tip
Practice PCA Dimensionality Reduction in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
PCA for preprocessing. UMAP for viz.
Practice Task
Note
Practice Task — (1) Write a working example of PCA Dimensionality Reduction from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with PCA Dimensionality Reduction is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.