K-Nearest Neighbors (KNN)

KNN is the laziest algorithm in ML — it stores all training data and makes predictions by finding the K most similar training examples (nearest neighbors) and voting on their labels. No explicit training step. The key parameter K: too small (K=1) overfits massively, too large underfits. Must scale features — distance is meaningless if income has 10,000x the scale of age.

20 min•By Priygop Team•Updated 2026

KNN — Distance, K Selection, Feature Scaling Impact

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# EFFECT OF K ON PERFORMANCE
print("Selecting optimal K value:")
k_values = [1, 3, 5, 7, 10, 15, 20, 30, 50]
cv_scores   = []
train_scores = []

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    cv  = cross_val_score(knn, X_train_sc, y_train, cv=5).mean()
    knn.fit(X_train_sc, y_train)
    train = knn.score(X_train_sc, y_train)
    cv_scores.append(cv)
    train_scores.append(train)
    print(f"  K={k:2d}: train={train:.4f} | CV={cv:.4f}")

# Plot K selection curve
plt.figure(figsize=(9, 4))
plt.plot(k_values, train_scores, "bo-", label="Train accuracy", linewidth=2)
plt.plot(k_values, cv_scores, "ro-", label="CV accuracy (5-fold)", linewidth=2)
plt.xlabel("K (number of neighbors)")
plt.ylabel("Accuracy")
plt.title("KNN: Choosing K -- sweet spot between over and underfitting")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("knn_k_selection.png", dpi=100, bbox_inches="tight")
plt.show()

optimal_k = k_values[np.argmax(cv_scores)]
print(f"\nOptimal K: {optimal_k}")

# CRITICAL: SCALING IMPACT ON KNN
print("\nScaling impact on KNN accuracy (K=5):")
for scaled, name in [(False, "Unscaled"), (True, "StandardScaler")]:
    knn = KNeighborsClassifier(n_neighbors=5)
    if scaled:
        pipe = Pipeline([("scaler", StandardScaler()), ("knn", knn)])
        cv = cross_val_score(pipe, X_train, y_train, cv=5).mean()
    else:
        cv = cross_val_score(knn, X_train, y_train, cv=5).mean()
    print(f"  {name:15s}: CV accuracy = {cv:.4f}")

# DISTANCE METRICS
print("\nDistance metric comparison (K=5, scaled):")
for metric in ["euclidean", "manhattan", "minkowski", "cosine"]:
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("knn", KNeighborsClassifier(n_neighbors=5, metric=metric)),
    ])
    cv = cross_val_score(pipe, X_train, y_train, cv=5).mean()
    print(f"  {metric:12s}: {cv:.4f}")

# REGRESSION WITH KNN
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
knn_reg = Pipeline([("scaler", StandardScaler()), ("model", KNeighborsRegressor(n_neighbors=10))])
score = cross_val_score(knn_reg, housing.data, housing.target, cv=5, scoring="r2").mean()
print(f"\nKNN Regression R2 on California Housing: {score:.4f}")

Tip

Practice KNearest Neighbors KNN in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

80% of ML work is data preparation — garbage in = garbage out

Practice Task

Note

Practice Task — (1) Write a working example of KNearest Neighbors KNN from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with KNearest Neighbors KNN is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module