Learn essential Python libraries for data science and machine learning. Master NumPy, Pandas, Matplotlib, and data preprocessing techniques.
Learn essential Python libraries for data science and machine learning. Master NumPy, Pandas, Matplotlib, and data preprocessing techniques.
Master the fundamental package for scientific computing in Python
Content by: Nirav Khanpara
AI/ML Engineer
NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays.
import numpy as np
# Create arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.zeros((3, 4)) # 3x4 array of zeros
arr3 = np.ones((2, 3)) # 2x3 array of ones
arr4 = np.arange(0, 10, 2) # Array from 0 to 10, step 2
arr5 = np.linspace(0, 1, 5) # 5 evenly spaced values from 0 to 1
# Random arrays
random_arr = np.random.rand(3, 3) # 3x3 random array
normal_arr = np.random.normal(0, 1, 100) # 100 normal distributed values
# Basic operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # [5 7 9]
print(a * b) # [4 10 18]
print(a ** 2) # [1 4 9]
# Statistical operations
data = np.array([1, 2, 3, 4, 5])
print(np.mean(data)) # 3.0
print(np.std(data)) # 1.414...
print(np.median(data)) # 3.0
print(np.max(data)) # 5
print(np.min(data)) # 1
Test your understanding of this topic:
Learn powerful data manipulation and analysis with Pandas
Content by: Nirav Khanpara
AI/ML Engineer
Pandas is a powerful data manipulation and analysis library for Python. It provides data structures for efficiently storing and manipulating large datasets, with tools for reading and writing data in various formats.
import pandas as pd
import numpy as np
# Create DataFrame from dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['NYC', 'LA', 'Chicago', 'Boston'],
'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
# Create DataFrame from list of lists
data_list = [
['Alice', 25, 'NYC', 50000],
['Bob', 30, 'LA', 60000],
['Charlie', 35, 'Chicago', 70000]
]
df2 = pd.DataFrame(data_list, columns=['Name', 'Age', 'City', 'Salary'])
print(df.head())
# Select columns
print(df['Name'])
print(df[['Name', 'Age']])
# Filter data
young_people = df[df['Age'] < 30]
high_salary = df[df['Salary'] > 60000]
# Multiple conditions
filtered = df[(df['Age'] > 25) & (df['Salary'] > 55000)]
# Sort data
sorted_df = df.sort_values('Age', ascending=False)
Test your understanding of this topic:
Create beautiful and informative data visualizations
Content by: Nirav Khanpara
AI/ML Engineer
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It provides a MATLAB-like plotting interface.
import matplotlib.pyplot as plt
import numpy as np
# Create data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', linewidth=2, label='sin(x)')
plt.plot(x, np.cos(x), 'r--', linewidth=2, label='cos(x)')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Scatter plot
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.scatter(x, y, alpha=0.6)
plt.title('Scatter Plot')
# Bar plot
plt.subplot(1, 3, 2)
categories = ['A', 'B', 'C', 'D']
values = [4, 3, 2, 1]
plt.bar(categories, values)
plt.title('Bar Plot')
# Histogram
plt.subplot(1, 3, 3)
data = np.random.normal(0, 1, 1000)
plt.hist(data, bins=30, alpha=0.7)
plt.title('Histogram')
plt.tight_layout()
plt.show()
Test your understanding of this topic:
Prepare and clean data for machine learning models
Content by: Nirav Khanpara
AI/ML Engineer
Data preprocessing is a crucial step in machine learning that involves cleaning, transforming, and preparing raw data for analysis and modeling.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# Create sample data with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [1.1, np.nan, 3.3, 4.4, 5.5],
'C': ['a', 'b', 'c', np.nan, 'e']
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull().sum())
# Fill missing values
df_filled = df.fillna(df.mean()) # For numerical columns
df_filled = df.fillna(df.mode().iloc[0]) # For categorical columns
# Using SimpleImputer
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
# Sample data
X = np.random.randn(100, 3)
y = np.random.randint(0, 2, 100)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardization (Z-score normalization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Min-Max scaling
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)
Test your understanding of this topic:
Continue your learning journey and master the next set of concepts.
Continue to Module 3