SHAP-Based Feature Selection
SHAP values are the most reliable feature importance method: they account for feature interactions, are consistent (a feature that always increases prediction always gets positive SHAP), and are globally and locally interpretable. Using mean |SHAP| to select features outperforms Gini impurity importance because it is not biased toward high-cardinality or correlated features.
Feature Selection with SHAP Values
import numpy as np
import pandas as pd
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import GradientBoostingClassifier
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# TRAIN XGB MODEL
xgb_model = xgb.XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=4,
eval_metric="logloss", verbosity=0, random_state=42)
xgb_model.fit(X_train, y_train)
# COMPUTE SHAP VALUES
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test)
# MEAN |SHAP| = GLOBAL FEATURE IMPORTANCE
mean_abs_shap = pd.DataFrame({
"Feature": cancer.feature_names,
"Mean |SHAP|": np.abs(shap_values).mean(axis=0),
"XGB Gini Imp": xgb_model.feature_importances_,
}).sort_values("Mean |SHAP|", ascending=False)
print("Feature ranking comparison: SHAP vs Gini Importance")
print(mean_abs_shap.round(4).to_string(index=False))
# SHAP SUMMARY PLOT (shows direction of influence too)
plt.figure(figsize=(9, 7))
shap.summary_plot(shap_values, X_test, plot_type="dot", show=False)
plt.title("SHAP Summary Plot -- Direction and Magnitude of Feature Impact")
plt.tight_layout()
plt.savefig("shap_selection.png", dpi=100, bbox_inches="tight")
plt.show()
# SELECT TOP-N FEATURES BY SHAP
def shap_select_train_eval(top_n: int, X_tr: pd.DataFrame, y_tr: pd.Series,
X_te: pd.DataFrame, y_te: pd.Series) -> float:
top_features = mean_abs_shap.head(top_n)["Feature"].tolist()
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
auc = cross_val_score(model, X_tr[top_features], y_tr, cv=5, scoring="roc_auc").mean()
return auc
print("\nAUC-ROC by number of features (SHAP-ranked):")
for n in [5, 10, 15, 20, 25, 30]:
auc = shap_select_train_eval(n, X_train, y_train, X_test, y_test)
print(f" Top {n:2d} features: AUC = {auc:.4f}")
print("\nConclusion: Beyond top-10 features, gains are marginal on this dataset.")
print("SHAP-based selection: fewer features + same AUC = simpler, faster, more interpretable model")Tip
Tip
Practice SHAPBased Feature Selection in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Feature engineering = 80% of ML success.
Practice Task
Note
Practice Task — (1) Write a working example of SHAPBased Feature Selection from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with SHAPBased Feature Selection is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.