Distribution Analysis — Histograms and Density Plots

Distribution analysis tells you the shape of each feature: is it normally distributed (symmetric bell curve), right-skewed (most values low, few very high — like income), left-skewed, bimodal, or uniform? This determines: which scaler to use, whether log-transform is needed, whether a feature degenerates (near-zero variance), and whether a feature discriminates between classes.

20 min•By Priygop Team•Updated 2026

Histograms, KDE, and Distribution Comparison

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df["diagnosis"] = cancer.target.astype(bool)  # True=benign, False=malignant

# HISTOGRAM + KDE -- see shape of single feature
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, col in enumerate(["mean radius", "mean texture", "mean area"]):
    sns.histplot(df[col], kde=True, ax=axes[i], bins=30, color="steelblue", edgecolor="white")
    axes[i].set_title(f"Distribution: {col}")
    axes[i].axvline(df[col].mean(), color="red", linestyle="--", label="mean")
    axes[i].axvline(df[col].median(), color="orange", linestyle="--", label="median")
    axes[i].legend()

plt.tight_layout()
plt.savefig("distributions.png", dpi=100, bbox_inches="tight")
plt.show()

# COMPARE DISTRIBUTION BY CLASS -- does this feature separate classes?
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, col in enumerate(["mean radius", "mean concavity", "mean texture"]):
    for label, color in [(True, "steelblue"), (False, "tomato")]:
        subset = df[df["diagnosis"] == label][col]
        name = "benign" if label else "malignant"
        sns.kdeplot(subset, ax=axes[i], fill=True, alpha=0.4, label=name, color=color)
    axes[i].set_title(f"Class Separation: {col}")
    axes[i].legend()

plt.tight_layout()
plt.savefig("class_distributions.png", dpi=100, bbox_inches="tight")
plt.show()

# QUANTIFY DISTRIBUTION SHAPE
print("Distribution shape metrics:")
for col in ["mean radius", "mean texture", "mean area"]:
    skewness = df[col].skew()
    kurtosis = df[col].kurt()
    shape = "right-skewed" if skewness > 0.5 else ("left-skewed" if skewness < -0.5 else "approx normal")
    print(f"  {col:20s}: skew={skewness:+.2f} ({shape}), kurt={kurtosis:+.2f}")

# BOXPLOTS -- spot outliers and compare groups
fig, ax = plt.subplots(figsize=(10, 5))
top_features = ["mean radius", "mean texture", "mean perimeter",
                "mean area", "mean smoothness"]
df_plot = df[top_features].copy()
# Normalize for comparable scales
df_plot = (df_plot - df_plot.min()) / (df_plot.max() - df_plot.min())
df_plot["diagnosis"] = df["diagnosis"].map({True: "benign", False: "malignant"})
df_melt = df_plot.melt(id_vars="diagnosis", var_name="feature", value_name="normalized_value")
sns.boxplot(data=df_melt, x="feature", y="normalized_value", hue="diagnosis", ax=ax)
ax.set_title("Feature Distributions by Diagnosis (Normalized)")
plt.xticks(rotation=30, ha="right")
plt.tight_layout()
plt.savefig("boxplots.png", dpi=100, bbox_inches="tight")
plt.show()
print("\nBox plot rule: features where boxes DON'T overlap are highly predictive!")

Tip

Practice Distribution Analysis Histograms and Density Plots in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Neural networks learn by adjusting connection weights via backpropagation

Practice Task

Note

Practice Task — (1) Write a working example of Distribution Analysis Histograms and Density Plots from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Distribution Analysis Histograms and Density Plots is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module