Model Calibration — Are Probabilities Trustworthy?
A model that predicts 80% probability should be right 80% of the time. When this isn't true, the model is miscalibrated. Random Forests tend to compress probabilities toward 0.5. SVMs output very extreme probabilities. Naive Bayes outputs extreme values due to the independence assumption. Calibration matters when probabilities drive business decisions — setting insurance premiums, loan rates, or medical risk scores.
Calibration Curves and Isotonic/Platt Calibration
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay
from sklearn.metrics import brier_score_loss
np.random.seed(42)
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_base, X_cal, y_train_base, y_cal = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
# MODELS WITH DIFFERENT CALIBRATION CHARACTERISTICS
models = {
"Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
"Naive Bayes": GaussianNB(),
}
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# CALIBRATION CURVES
for name, model in models.items():
model.fit(X_train, y_train)
CalibrationDisplay.from_estimator(model, X_test, y_test, n_bins=10, ax=axes[0], name=name)
bs = brier_score_loss(y_test, model.predict_proba(X_test)[:, 1])
print(f" {name:22s}: Brier score = {bs:.4f} (lower = better calibration + accuracy)")
axes[0].set_title("Calibration Curves\n(Diagonal = perfectly calibrated)")
# FIX CALIBRATION: Platt Scaling (sigmoid) or Isotonic Regression
rf_uncal = RandomForestClassifier(n_estimators=100, random_state=42)
rf_uncal.fit(X_train_base, y_train_base)
rf_platt = CalibratedClassifierCV(RandomForestClassifier(n_estimators=100, random_state=42), method="sigmoid", cv=5)
rf_isotonic = CalibratedClassifierCV(RandomForestClassifier(n_estimators=100, random_state=42), method="isotonic", cv=5)
for cal_model, cal_name in [(rf_uncal, "RF Uncalibrated"), (rf_platt, "RF + Platt"), (rf_isotonic, "RF + Isotonic")]:
cal_model.fit(X_train, y_train)
CalibrationDisplay.from_estimator(cal_model, X_test, y_test, n_bins=10, ax=axes[1], name=cal_name)
bs = brier_score_loss(y_test, cal_model.predict_proba(X_test)[:, 1])
print(f" {cal_name:22s}: Brier score = {bs:.4f}")
axes[1].set_title("Calibration: RF Before/After Calibration")
plt.tight_layout()
plt.savefig("calibration_curves.png", dpi=100, bbox_inches="tight")
plt.show()
print("\nWhen calibration matters:")
use_cases = [
"Insurance pricing: probabilities directly map to premium calculations",
"Medical risk scoring: 30% risk score means different treatment than 60%",
"Ranking/sorting: if only comparing predictions, ROC-AUC matters, not calibration",
"Decision threshold: if just predicting class labels, calibration less critical",
]
for case in use_cases:
print(f" -> {case}")Tip
Tip
Practice Model Calibration Are Probabilities Trustworthy in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Model Calibration Are Probabilities Trustworthy from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Model Calibration Are Probabilities Trustworthy is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.