Pandas GroupBy, Merge & Reshape

GroupBy and merge are the SQL of pandas — essential for feature engineering from multiple tables. GroupBy aggregates data by category (average purchase per customer, fraud rate per merchant). Merge joins DataFrames on common keys (like SQL JOIN). Pivot tables and melt reshape data between wide and long formats.

20 min•By Priygop Team•Updated 2026

GroupBy, Merge, and Reshape

import pandas as pd
import numpy as np

np.random.seed(42)

# SIMULATE: transactions and customer lookup tables
transactions = pd.DataFrame({
    "transaction_id": range(1, 21),
    "customer_id":    np.random.choice([101, 102, 103, 104, 105], 20),
    "amount":         np.random.exponential(100, 20).round(2),
    "category":       np.random.choice(["grocery", "online", "restaurant", "travel"], 20),
    "is_fraud":       np.random.choice([0, 1], 20, p=[0.9, 0.1]),
})

customers = pd.DataFrame({
    "customer_id":  [101, 102, 103, 104, 105],
    "age":          [28, 45, 31, 52, 38],
    "city":         ["New York", "Chicago", "LA", "Houston", "Phoenix"],
    "account_type": ["premium", "standard", "standard", "premium", "standard"],
})

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# GROUPBY -- aggregate statistics
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
print("Per-customer transaction statistics:")
customer_stats = transactions.groupby("customer_id").agg(
    total_spent   = ("amount", "sum"),
    avg_txn       = ("amount", "mean"),
    num_txns      = ("amount", "count"),
    fraud_count   = ("is_fraud", "sum"),
    fraud_rate    = ("is_fraud", "mean"),
).round(2)
print(customer_stats)

print("\nFraud rate by category:")
print(transactions.groupby("category")["is_fraud"].agg(["mean", "count"]).round(3))

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# MERGE -- join tables
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Inner join: only matching customer_ids
full_df = transactions.merge(customers, on="customer_id", how="inner")
print(f"\nTransactions + customer info: {full_df.shape}")
print(full_df[["customer_id", "amount", "category", "city", "account_type"]].head())

# Merge aggregated stats back to customers
customer_with_stats = customers.merge(customer_stats.reset_index(), on="customer_id", how="left")
print("\nCustomers with transaction features:")
print(customer_with_stats)

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# PIVOT TABLE
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
pivot = transactions.pivot_table(
    values="amount",
    index="customer_id",
    columns="category",
    aggfunc="sum",
    fill_value=0,
)
print("\nSpend per customer per category (pivot):")
print(pivot.round(2))

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# APPLY -- custom functions per row/group
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
def risk_score(row: pd.Series) -> str:
    if row["fraud_rate"] > 0.15:
        return "high"
    elif row["fraud_rate"] > 0.05:
        return "medium"
    return "low"

customer_stats["risk_tier"] = customer_stats.apply(risk_score, axis=1)
print("\nCustomer risk tiers:")
print(customer_stats[["fraud_rate", "risk_tier"]])

Tip

Practice Pandas GroupBy Merge Reshape in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.

Diagram

Loading diagram…

θ = θ - α × ∇L(θ). Too high α = diverge. Too low = slow.

Practice Task

Note

Practice Task — (1) Write a working example of Pandas GroupBy Merge Reshape from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.

Quick Quiz

Common Mistake

Warning

A common mistake with Pandas GroupBy Merge Reshape is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ml code.

Topics in This Module