Distribution Analysis — Histograms and Density Plots
Distribution analysis tells you the shape of each feature: is it normally distributed (symmetric bell curve), right-skewed (most values low, few very high — like income), left-skewed, bimodal, or uniform? This determines: which scaler to use, whether log-transform is needed, whether a feature degenerates (near-zero variance), and whether a feature discriminates between classes.
Histograms, KDE, and Distribution Comparison
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df["diagnosis"] = cancer.target.astype(bool) # True=benign, False=malignant
# HISTOGRAM + KDE -- see shape of single feature
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, col in enumerate(["mean radius", "mean texture", "mean area"]):
sns.histplot(df[col], kde=True, ax=axes[i], bins=30, color="steelblue", edgecolor="white")
axes[i].set_title(f"Distribution: {col}")
axes[i].axvline(df[col].mean(), color="red", linestyle="--", label="mean")
axes[i].axvline(df[col].median(), color="orange", linestyle="--", label="median")
axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png", dpi=100, bbox_inches="tight")
plt.show()
# COMPARE DISTRIBUTION BY CLASS -- does this feature separate classes?
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, col in enumerate(["mean radius", "mean concavity", "mean texture"]):
for label, color in [(True, "steelblue"), (False, "tomato")]:
subset = df[df["diagnosis"] == label][col]
name = "benign" if label else "malignant"
sns.kdeplot(subset, ax=axes[i], fill=True, alpha=0.4, label=name, color=color)
axes[i].set_title(f"Class Separation: {col}")
axes[i].legend()
plt.tight_layout()
plt.savefig("class_distributions.png", dpi=100, bbox_inches="tight")
plt.show()
# QUANTIFY DISTRIBUTION SHAPE
print("Distribution shape metrics:")
for col in ["mean radius", "mean texture", "mean area"]:
skewness = df[col].skew()
kurtosis = df[col].kurt()
shape = "right-skewed" if skewness > 0.5 else ("left-skewed" if skewness < -0.5 else "approx normal")
print(f" {col:20s}: skew={skewness:+.2f} ({shape}), kurt={kurtosis:+.2f}")
# BOXPLOTS -- spot outliers and compare groups
fig, ax = plt.subplots(figsize=(10, 5))
top_features = ["mean radius", "mean texture", "mean perimeter",
"mean area", "mean smoothness"]
df_plot = df[top_features].copy()
# Normalize for comparable scales
df_plot = (df_plot - df_plot.min()) / (df_plot.max() - df_plot.min())
df_plot["diagnosis"] = df["diagnosis"].map({True: "benign", False: "malignant"})
df_melt = df_plot.melt(id_vars="diagnosis", var_name="feature", value_name="normalized_value")
sns.boxplot(data=df_melt, x="feature", y="normalized_value", hue="diagnosis", ax=ax)
ax.set_title("Feature Distributions by Diagnosis (Normalized)")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.savefig("boxplots.png", dpi=100, bbox_inches="tight")
plt.show()
print("\nBox plot rule: features where boxes DON'T overlap are highly predictive!")Tip
Tip
Practice Distribution Analysis Histograms and Density Plots in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Neural networks learn by adjusting connection weights via backpropagation
Practice Task
Note
Practice Task — (1) Write a working example of Distribution Analysis Histograms and Density Plots from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Distribution Analysis Histograms and Density Plots is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.