Data Preprocessing
Prepare and clean data for machine learning models. This is a foundational concept in artificial intelligence and machine learning that professional developers rely on daily. The explanations below are written to be beginner-friendly while covering the depth and nuance that comes from real-world AI/ML experience. Take your time with each section and practice the examples
What is Data Preprocessing?
Data preprocessing is a crucial step in machine learning that involves cleaning, transforming, and preparing raw data for analysis and modeling.. This is an essential concept that every AI/ML developer must understand thoroughly. In professional development environments, getting this right can mean the difference between code that works reliably and code that breaks in production. The following sections break this down into clear, digestible pieces with practical examples you can try immediately
Handling Missing Data
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# Create sample data with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [1.1, np.nan, 3.3, 4.4, 5.5],
'C': ['a', 'b', 'c', np.nan, 'e']
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull().sum())
# Fill missing values
df_filled = df.fillna(df.mean()) # For numerical columns
df_filled = df.fillna(df.mode().iloc[0]) # For categorical columns
# Using SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
# Sample data
X = np.random.randn(100, 3)
y = np.random.randint(0, 2, 100)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardization (Z-score normalization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Min-Max scaling
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)