Algorithm Selection Cheat Sheet
Choosing the right classification algorithm in practice: always start with logistic regression as a baseline. Add a tree or random forest to capture non-linearity. Reserve SVM and tuned ensembles for when you need maximum performance. Consider dataset size, interpretability requirements, training speed, and feature types when making the choice.
Classification Algorithm Selection Guide
# CLASSIFICATION ALGORITHM SELECTION GUIDE
# Run this as a decision framework for any new classification project
def recommend_classifier(
n_samples: int,
n_features: int,
interpretable: bool,
has_missing: bool,
data_type: str, # "tabular", "text", "time_series"
priority: str, # "speed", "accuracy", "balance"
) -> str:
"""Recommend a classification algorithm given problem characteristics."""
if data_type == "text":
return "Naive Bayes (MultinomialNB) or Logistic Regression with TF-IDF"
if n_samples < 500:
return "SVM (RBF) -- works well on small datasets; also try LogisticRegression"
if interpretable:
if n_features < 20:
return "LogisticRegression -- coefficients interpretable; or DecisionTree for flowchart"
return "LogisticRegression with L1 (Lasso) for feature selection"
if n_samples > 100_000:
if priority == "speed":
return "LogisticRegression with SGDClassifier (mini-batch, fast)"
return "LightGBM -- handles large data with built-in missing values"
if has_missing:
return "RandomForest or XGBoost/LightGBM -- handle missing values natively"
if priority == "accuracy":
return "GradientBoosting (XGBoost/LightGBM) -- typically highest accuracy on tabular data"
return "RandomForest -- robust, easy to tune, good baseline for most problems"
# TEST THE RECOMMENDER
test_cases = [
{"n_samples": 200, "n_features": 10, "interpretable": False, "has_missing": False, "data_type": "tabular", "priority": "accuracy"},
{"n_samples": 50000, "n_features": 25, "interpretable": True, "has_missing": False, "data_type": "tabular", "priority": "balance"},
{"n_samples": 5000, "n_features": 50, "interpretable": False, "has_missing": True, "data_type": "tabular", "priority": "accuracy"},
{"n_samples": 10000, "n_features": 5, "interpretable": True, "has_missing": False, "data_type": "text", "priority": "speed"},
{"n_samples": 500000,"n_features": 30, "interpretable": False, "has_missing": True, "data_type": "tabular", "priority": "accuracy"},
]
print("Classification Algorithm Recommendations:")
print("-" * 70)
for i, case in enumerate(test_cases, 1):
rec = recommend_classifier(**case)
print(f"\nCase {i}: n={case['n_samples']}, features={case['n_features']}, interp={case['interpretable']},")
print(f" missing={case['has_missing']}, type={case['data_type']}, priority={case['priority']}")
print(f" -> Recommended: {rec}")
# QUICK COMPARISON TABLE
print("\n\nQuick Reference Table:")
comparison = {
"LogisticRegression": {"speed": "Fast", "accuracy": "Good", "interpret": "High", "missing": "No", "scales": ">1M"},
"DecisionTree": {"speed": "Fast", "accuracy": "Medium", "interpret": "High", "missing": "No", "scales": ">1M"},
"RandomForest": {"speed": "Medium", "accuracy": "Great", "interpret": "Medium", "missing": "Yes", "scales": "500k"},
"GradientBoosting": {"speed": "Slow", "accuracy": "Best", "interpret": "Low", "missing": "Yes", "scales": "500k"},
"SVM (RBF)": {"speed": "Slow", "accuracy": "Great", "interpret": "Low", "missing": "No", "scales": "50k"},
"KNN": {"speed": "Slow*", "accuracy": "Good", "interpret": "Low", "missing": "No", "scales": "100k"},
"NaiveBayes": {"speed": "Fastest","accuracy": "Medium", "interpret": "Medium", "missing": "No", "scales": ">1M"},
}
header = f"{'Algorithm':<22} {'Train Speed':>12} {'Accuracy':>10} {'Interpretable':>14} {'Handles NaN':>12} {'Max Rows':>10}"
print(header)
print("-" * 84)
for name, props in comparison.items():
print(f"{name:<22} {props['speed']:>12} {props['accuracy']:>10} {props['interpret']:>14} {props['missing']:>12} {props['scales']:>10}")Tip
Tip
Practice Algorithm Selection Cheat Sheet in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Technical diagram.
Practice Task
Note
Practice Task — (1) Write a working example of Algorithm Selection Cheat Sheet from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Algorithm Selection Cheat Sheet is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.