Intermediate

Pandas

Master the Pandas library for data manipulation — from creating DataFrames to grouping, merging, pivot tables, and method chaining.

Series and DataFrame

Python

import pandas as pd

# Series - 1D labeled array
s = pd.Series([10, 20, 30], index=["a", "b", "c"])

# DataFrame - 2D labeled table
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [70000, 80000, 90000]
})

Loading Data

Python

# Read from files
df = pd.read_csv("data.csv")
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
df = pd.read_json("data.json")

# Quick inspection
print(df.head())        # First 5 rows
print(df.info())        # Column types, non-null counts
print(df.describe())    # Statistical summary
print(df.shape)         # (rows, columns)
print(df.columns)       # Column names

Indexing: loc and iloc

Python

# loc - label-based indexing
df.loc[0, "name"]                # Single value
df.loc[0:2, ["name", "age"]]    # Rows 0-2, specific columns
df.loc[df["age"] > 25]          # Boolean filter

# iloc - integer position-based indexing
df.iloc[0, 1]                   # Row 0, column 1
df.iloc[0:2, 0:2]              # First 2 rows, first 2 cols
df.iloc[-1]                     # Last row

Filtering and Querying

Python

# Boolean filtering
high_salary = df[df["salary"] > 75000]
adults = df[(df["age"] >= 25) & (df["age"] <= 35)]

# query() method (more readable)
result = df.query("age > 25 and salary > 75000")

# isin() for membership
selected = df[df["name"].isin(["Alice", "Bob"])]

Sorting

Python

df.sort_values("salary", ascending=False)
df.sort_values(["age", "salary"], ascending=[True, False])
df.sort_index()  # Sort by index

GroupBy and Aggregation

Python

# Group and aggregate
df.groupby("department")["salary"].mean()
df.groupby("department").agg({
    "salary": ["mean", "max", "min"],
    "age": "mean"
})

# Value counts
df["department"].value_counts()

Merge, Join, Concat

Python

# Merge (SQL-style join)
merged = pd.merge(df1, df2, on="id", how="left")
merged = pd.merge(df1, df2, left_on="emp_id", right_on="id")

# Concatenate (stack)
combined = pd.concat([df1, df2], axis=0)   # Stack rows
combined = pd.concat([df1, df2], axis=1)   # Stack columns

Pivot Tables

Python

pivot = df.pivot_table(
    values="salary",
    index="department",
    columns="role",
    aggfunc="mean",
    fill_value=0
)

Method Chaining

Python

result = (
    df
    .query("age > 25")
    .assign(bonus=lambda x: x["salary"] * 0.1)
    .sort_values("salary", ascending=False)
    .reset_index(drop=True)
)

✅

Method chaining makes your data pipeline readable and declarative. Each step transforms the data and passes it to the next step, similar to piping in Unix or dplyr in R.

← PreviousNumPy Next →Data Cleaning