Random Forest — Bagging at Scale

Random Forest builds many decision trees on random bootstrap samples and random feature subsets, then aggregates their predictions. The two sources of randomness (data AND features) ensure trees are diverse enough to cancel out each other's errors. Feature importance from Random Forest is one of the most reliable and widely-used feature selection methods in industry.

25 min•By Priygop Team•Updated 2026

Random Forest — Key Parameters and Feature Importance

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.inspection import permutation_importance

cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# KEY PARAMETER EFFECTS
print("Random Forest parameter sensitivity:")
for n_trees in [10, 50, 100, 200, 500]:
    rf = RandomForestClassifier(n_estimators=n_trees, random_state=42, n_jobs=-1)
    cv = cross_val_score(rf, X_train, y_train, cv=5, scoring="roc_auc").mean()
    print(f"  n_estimators={n_trees:4d}: AUC={cv:.4f}")

# OPTIMAL MODEL
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,         # grow fully -- RF controls overfitting via ensemble
    min_samples_leaf=2,     # require >= 2 samples at leaf (smooths predictions)
    max_features="sqrt",    # sqrt(n_features) at each split -- key RF randomness
    class_weight="balanced",
    random_state=42,
    n_jobs=-1,
)
rf.fit(X_train, y_train)
y_prob = rf.predict_proba(X_test)[:, 1]
print(f"\nFinal RF Test AUC: {roc_auc_score(y_test, y_prob):.4f}")

# FEATURE IMPORTANCE TYPE 1: Gini impurity decrease (fast, built-in)
gini_importance = pd.DataFrame({
    "Feature": cancer.feature_names,
    "Gini_Importance": rf.feature_importances_,
}).sort_values("Gini_Importance", ascending=False)

# FEATURE IMPORTANCE TYPE 2: Permutation importance (slower, more reliable)
perm_result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
perm_importance = pd.DataFrame({
    "Feature": cancer.feature_names,
    "Perm_Mean": perm_result.importances_mean,
    "Perm_Std":  perm_result.importances_std,
}).sort_values("Perm_Mean", ascending=False)

# VISUALIZE BOTH
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

gini_importance.head(15).plot(kind="barh", x="Feature", y="Gini_Importance", ax=axes[0], color="steelblue", legend=False)
axes[0].set_title("Gini Feature Importance (fast)")
axes[0].invert_yaxis()

perm_importance.head(15).plot(kind="barh", x="Feature", y="Perm_Mean", ax=axes[1], xerr="Perm_Std", color="coral", legend=False)
axes[1].set_title("Permutation Importance (reliable)")
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig("rf_feature_importance.png", dpi=100, bbox_inches="tight")
plt.show()

# OUT-OF-BAG ERROR -- free internal validation!
rf_oob = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42, n_jobs=-1)
rf_oob.fit(X, y)  # fitting on all data
print(f"\nOOB Score (free estimate without separate val set): {rf_oob.oob_score_:.4f}")
print("  How: each tree predicts on the ~37% of data it wasn't trained on")

Tip

Practice Random Forest Bagging at Scale in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

XGBoost = best for tabular.

Practice Task

Note

Practice Task — (1) Write a working example of Random Forest Bagging at Scale from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Random Forest Bagging at Scale is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module