SHAP-Based Feature Selection

SHAP values are the most reliable feature importance method: they account for feature interactions, are consistent (a feature that always increases prediction always gets positive SHAP), and are globally and locally interpretable. Using mean |SHAP| to select features outperforms Gini impurity importance because it is not biased toward high-cardinality or correlated features.

15 min•By Priygop Team•Updated 2026

Feature Selection with SHAP Values

import numpy as np
import pandas as pd
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import GradientBoostingClassifier

cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TRAIN XGB MODEL
xgb_model = xgb.XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=4,
                                eval_metric="logloss", verbosity=0, random_state=42)
xgb_model.fit(X_train, y_train)

# COMPUTE SHAP VALUES
explainer   = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test)

# MEAN |SHAP| = GLOBAL FEATURE IMPORTANCE
mean_abs_shap = pd.DataFrame({
    "Feature":         cancer.feature_names,
    "Mean |SHAP|":     np.abs(shap_values).mean(axis=0),
    "XGB Gini Imp":    xgb_model.feature_importances_,
}).sort_values("Mean |SHAP|", ascending=False)

print("Feature ranking comparison: SHAP vs Gini Importance")
print(mean_abs_shap.round(4).to_string(index=False))

# SHAP SUMMARY PLOT (shows direction of influence too)
plt.figure(figsize=(9, 7))
shap.summary_plot(shap_values, X_test, plot_type="dot", show=False)
plt.title("SHAP Summary Plot -- Direction and Magnitude of Feature Impact")
plt.tight_layout()
plt.savefig("shap_selection.png", dpi=100, bbox_inches="tight")
plt.show()

# SELECT TOP-N FEATURES BY SHAP
def shap_select_train_eval(top_n: int, X_tr: pd.DataFrame, y_tr: pd.Series,
                            X_te: pd.DataFrame, y_te: pd.Series) -> float:
    top_features = mean_abs_shap.head(top_n)["Feature"].tolist()
    model = GradientBoostingClassifier(n_estimators=100, random_state=42)
    auc = cross_val_score(model, X_tr[top_features], y_tr, cv=5, scoring="roc_auc").mean()
    return auc

print("\nAUC-ROC by number of features (SHAP-ranked):")
for n in [5, 10, 15, 20, 25, 30]:
    auc = shap_select_train_eval(n, X_train, y_train, X_test, y_test)
    print(f"  Top {n:2d} features: AUC = {auc:.4f}")

print("\nConclusion: Beyond top-10 features, gains are marginal on this dataset.")
print("SHAP-based selection: fewer features + same AUC = simpler, faster, more interpretable model")

Tip

Practice SHAPBased Feature Selection in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

Feature engineering = 80% of ML success.

Practice Task

Note

Practice Task — (1) Write a working example of SHAPBased Feature Selection from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with SHAPBased Feature Selection is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module