Data Preprocessing
Prepare and clean data for machine learning models
55 min•By Priygop Team•Last updated: Feb 2026
What is Data Preprocessing?
Data preprocessing is a crucial step in machine learning that involves cleaning, transforming, and preparing raw data for analysis and modeling.
Handling Missing Data
Example
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# Create sample data with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [1.1, np.nan, 3.3, 4.4, 5.5],
'C': ['a', 'b', 'c', np.nan, 'e']
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull().sum())
# Fill missing values
df_filled = df.fillna(df.mean()) # For numerical columns
df_filled = df.fillna(df.mode().iloc[0]) # For categorical columns
# Using SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)Feature Scaling
Example
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
# Sample data
X = np.random.randn(100, 3)
y = np.random.randint(0, 2, 100)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardization (Z-score normalization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Min-Max scaling
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)Try It Yourself — Data Preprocessing
Try It Yourself — Data PreprocessingPython
Python Editor
✓ ValidTab = 2 spaces
Python|34 lines|770 chars|✓ Valid syntax
UTF-8