t-SNE & UMAP — High-Dimensional Visualization
PCA is a linear projection — it misses complex non-linear structure. t-SNE (t-Distributed Stochastic Neighbor Embedding) preserves local neighbourhood structure in 2D or 3D, revealing clusters, manifolds, and class separation invisible to PCA. Used for visualizing embeddings, NLP sentence representations, and image features. Key: t-SNE is for visualization only — do NOT use t-SNE features as model inputs.
t-SNE vs PCA for High-Dimensional Visualization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
# DIGITS DATASET: 8x8 images = 64 features, 10 classes (0-9)
digits = load_digits()
X, y = digits.data, digits.target
print(f"Digits dataset: {X.shape} features -> visualizing in 2D")
# SCALE BEFORE TSNE (recommended)
X_sc = StandardScaler().fit_transform(X)
# PCA first to speed up t-SNE (recommended for large datasets)
X_pca50 = PCA(n_components=50, random_state=42).fit_transform(X_sc)
# t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42, learning_rate="auto", init="pca")
X_tsne = tsne.fit_transform(X_pca50)
# PCA 2D for comparison
pca = PCA(n_components=2, random_state=42)
X_pca2d = pca.fit_transform(X_sc)
fig, axes = plt.subplots(1, 2, figsize=(16, 7))
colors = plt.cm.tab10(np.linspace(0, 1, 10))
for ax, (X_embed, method) in zip(axes, [(X_pca2d, "PCA"), (X_tsne, "t-SNE")]):
for digit in range(10):
mask = y == digit
ax.scatter(X_embed[mask, 0], X_embed[mask, 1], c=[colors[digit]], label=str(digit),
s=20, alpha=0.7)
ax.set_title(f"{method} Embedding -- Digit Recognition\nEach color = one digit class")
ax.legend(title="Digit", loc="upper right", fontsize=8, markerscale=1.5)
plt.tight_layout()
plt.savefig("tsne_pca.png", dpi=100, bbox_inches="tight")
plt.show()
# t-SNE PARAMETERS
print("t-SNE critical parameters:")
params = {
"perplexity": "Controls balance of local vs global structure. Range: 5-50. Try 30 first.",
"n_iter": "More iterations = better convergence. 1000 minimum; use 2000 for production.",
"learning_rate": "Use 'auto' (sklearn>=1.2). Manual: n_samples/12 is a common heuristic.",
"init": "'pca' gives better results than random initialization. Use 'pca'.",
"random_state": "Fix for reproducibility -- t-SNE is stochastic, different runs look different.",
}
for param, tip in params.items():
print(f" {param:16s}: {tip}")
print("\nCRITICAL RULES for t-SNE:")
rules = [
"Do NOT use t-SNE coordinates as features in models -- distances are not globally meaningful",
"Same class does NOT mean nearby in t-SNE -- cluster sizes are meaningless",
"Different runs produce different layouts -- fix random_state",
"Run PCA first (to ~50 dims) before t-SNE when n_features >> 50",
"UMAP (pip install umap-learn) is faster, preserves global structure better, more reproducible",
]
for rule in rules:
print(f" -> {rule}")Tip
Tip
Practice tSNE UMAP HighDimensional Visualization in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Neural networks learn by adjusting connection weights via backpropagation
Practice Task
Note
Practice Task — (1) Write a working example of tSNE UMAP HighDimensional Visualization from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with tSNE UMAP HighDimensional Visualization is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.