Course/Module 2/Topic 4 of 4Beginner

Data Preprocessing

Prepare and clean data for machine learning models. This is a foundational concept in artificial intelligence and machine learning that professional developers rely on daily. The explanations below are written to be beginner-friendly while covering the depth and nuance that comes from real-world AI/ML experience. Take your time with each section and practice the examples

55 min•By Priygop Team•Last updated: Feb 2026

What is Data Preprocessing?

Data preprocessing is a crucial step in machine learning that involves cleaning, transforming, and preparing raw data for analysis and modeling.. This is an essential concept that every AI/ML developer must understand thoroughly. In professional development environments, getting this right can mean the difference between code that works reliably and code that breaks in production. The following sections break this down into clear, digestible pieces with practical examples you can try immediately

Handling Missing Data

Example

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create sample data with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [1.1, np.nan, 3.3, 4.4, 5.5],
    'C': ['a', 'b', 'c', np.nan, 'e']
}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull().sum())

# Fill missing values
df_filled = df.fillna(df.mean())  # For numerical columns
df_filled = df.fillna(df.mode().iloc[0])  # For categorical columns

# Using SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

Feature Scaling

Example

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Sample data
X = np.random.randn(100, 3)
y = np.random.randint(0, 2, 100)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardization (Z-score normalization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Min-Max scaling
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)

Try It Yourself — Data Preprocessing

Try It Yourself — Data PreprocessingPython

Python Editor

✓ ValidTab = 2 spaces

# Data Preprocessing Demo
# Run this in your Python environment with pandas and numpy installed

import numpy as np

# Simulating a dataset with missing values
data = np.array([
    [25, 50000, 1],
    [30, 60000, 0],
    [np.nan, 55000, 1],
    [35, np.nan, 0],
    [28, 48000, 1]
])

print("Original data:")
print(data)
print()

# Handle missing values with mean imputation
col_means = np.nanmean(data, axis=0)
missing = np.isnan(data)
data[missing] = np.take(col_means, np.where(missing)[1])

print("After mean imputation:")
print(data)
print()

# Feature scaling (Min-Max normalization)
for i in range(data.shape[1]):
    col = data[:, i]
    data[:, i] = (col - col.min()) / (col.max() - col.min())

print("After Min-Max scaling (0 to 1):")
print(np.round(data, 3))

Output

Click ▶ Run to see the result

Edit the code on the left, then click Run

Python|34 lines|770 chars|✓ Valid syntax

UTF-8

Quick Quiz — Data Preprocessing

Next Module →

Topics in This Module

Course/Module 2/Topic 4 of 4Beginner

Data Preprocessing

55 min•By Priygop Team•Last updated: Feb 2026

What is Data Preprocessing?

Handling Missing Data

Example

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create sample data with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [1.1, np.nan, 3.3, 4.4, 5.5],
    'C': ['a', 'b', 'c', np.nan, 'e']
}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull().sum())

# Fill missing values
df_filled = df.fillna(df.mean())  # For numerical columns
df_filled = df.fillna(df.mode().iloc[0])  # For categorical columns

# Using SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

Feature Scaling

Example

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Sample data
X = np.random.randn(100, 3)
y = np.random.randint(0, 2, 100)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardization (Z-score normalization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Min-Max scaling
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)

Try It Yourself — Data Preprocessing

Try It Yourself — Data PreprocessingPython

Python Editor

✓ ValidTab = 2 spaces

# Data Preprocessing Demo
# Run this in your Python environment with pandas and numpy installed

import numpy as np

# Simulating a dataset with missing values
data = np.array([
    [25, 50000, 1],
    [30, 60000, 0],
    [np.nan, 55000, 1],
    [35, np.nan, 0],
    [28, 48000, 1]
])

print("Original data:")
print(data)
print()

# Handle missing values with mean imputation
col_means = np.nanmean(data, axis=0)
missing = np.isnan(data)
data[missing] = np.take(col_means, np.where(missing)[1])

print("After mean imputation:")
print(data)
print()

# Feature scaling (Min-Max normalization)
for i in range(data.shape[1]):
    col = data[:, i]
    data[:, i] = (col - col.min()) / (col.max() - col.min())

print("After Min-Max scaling (0 to 1):")
print(np.round(data, 3))

Output

Click ▶ Run to see the result

Edit the code on the left, then click Run

Python|34 lines|770 chars|✓ Valid syntax

UTF-8

Quick Quiz — Data Preprocessing

Next Module →

Topics in This Module

Priygop - Leading Professional Development Platform | Expert Courses & Interview Prep