Naive Bayes — Probabilistic Classification
Naive Bayes applies Bayes' theorem with the naive assumption that features are conditionally independent given the class. Despite this strong (usually wrong) assumption, it works surprisingly well for text classification, spam filtering, and medical diagnosis. It's extremely fast to train and update incrementally — making it ideal for streaming data and quick baselines.
Naive Bayes Variants and Spam Detection
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report
# THREE NAIVE BAYES VARIANTS
nb_guide = {
"GaussianNB": "Continuous features following normal distribution (age, weight, measurements)",
"MultinomialNB": "Count features (word frequencies in text, NLP bag-of-words)",
"BernoulliNB": "Binary features (word present/absent, yes/no flags)",
}
for variant, use_case in nb_guide.items():
print(f" {variant:16s}: {use_case}")
# GAUSSIAN NB ON MEDICAL DATA
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
gnb = GaussianNB()
cv = cross_val_score(gnb, X, y, cv=5, scoring="accuracy")
print(f"\nGaussianNB on breast cancer: CV={cv.mean():.4f} (trained in milliseconds!)")
# MULTINOMIAL NB FOR TEXT (SPAM DETECTION)
spam_emails = [
"Special offer buy now click here amazing deal free money",
"Act now limited time offer you are selected win prize",
"Click here to claim your reward exclusive sale today",
"Free vacation winner notification click immediately",
"Hi John, can we schedule a meeting for Tuesday afternoon?",
"Please review the attached quarterly report documents",
"Team meeting agenda for next week project updates",
"Your invoice for March services is attached for review",
"Lunch at noon? Let me know if that works for the team",
"Project deadline reminder please submit by Friday",
]
labels = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0] # 1=spam, 0=ham
X_train, X_test, y_train, y_test = train_test_split(spam_emails, labels, test_size=0.3, random_state=42)
spam_pipeline = Pipeline([
("tfidf", TfidfVectorizer(stop_words="english", ngram_range=(1, 2))),
("classifier", MultinomialNB(alpha=1.0)), # alpha=1: Laplace smoothing
])
spam_pipeline.fit(X_train, y_train)
y_pred = spam_pipeline.predict(X_test)
y_prob = spam_pipeline.predict_proba(X_test)[:, 1]
print("\nSpam Classifier:")
print(classification_report(y_test, y_pred, target_names=["Ham", "Spam"], zero_division=0))
# TEST NEW EMAILS
test_emails = [
"Congratulations you won a prize click here",
"Hi, let's discuss the project proposal tomorrow",
]
for email in test_emails:
prob = spam_pipeline.predict_proba([email])[0, 1]
verdict = "SPAM" if prob > 0.5 else "HAM"
print(f" '{email[:40]}...' -> {verdict} ({prob:.1%})")
# INCREMENTAL LEARNING -- unique to Naive Bayes
partial_gnb = GaussianNB()
X_arr, y_arr = cancer.data, cancer.target
# Update model with new data batches (streaming scenario)
for batch_start in range(0, len(X_arr), 50):
partial_gnb.partial_fit(X_arr[batch_start:batch_start+50], y_arr[batch_start:batch_start+50], classes=np.unique(y_arr))
print(f"\nStreaming GaussianNB accuracy: {partial_gnb.score(X_arr, y_arr):.4f}")Tip
Tip
Practice Naive Bayes Probabilistic Classification in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Classification = predict categories. Regression = predict numbers. Some algorithms do both.
Practice Task
Note
Practice Task — (1) Write a working example of Naive Bayes Probabilistic Classification from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Naive Bayes Probabilistic Classification is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.