Beginner

Pandas for Data Interviews

Pandas is the single most tested library in data science and ML engineering interviews. This lesson covers what interviewers expect, how pandas maps to SQL, and the performance patterns that separate strong candidates from the rest.

Why Pandas Dominates Data Interviews

Every data science and ML engineering role requires manipulating tabular data. While SQL handles database queries, pandas is the standard for in-memory data transformation, feature engineering, and exploratory analysis. Interviewers test pandas because it reveals how you think about data.

📊

Feature Engineering

Building ML features from raw data requires groupby, window functions, pivoting, and merging — all core pandas operations tested in interviews.

⚡

Data Cleaning

Real datasets have missing values, duplicates, and inconsistent formats. Interviewers test your ability to handle messy data fluently with pandas.

💡

Exploratory Analysis

Quick data profiling, distribution analysis, and correlation checks are daily tasks. Doing them efficiently in pandas shows practical experience.

What Companies Actually Test

Pandas interview questions vary by company type. Here is what to expect:

Company Type	Difficulty	Focus Areas	Format
Google / Meta DS	Medium-Hard	GroupBy, window functions, merge, time series	Colab notebook
Amazon ML	Medium	Selection, groupby, pivot, data cleaning	Whiteboard or Colab
Fintech / Quant	Hard	Time series, rolling windows, performance	Live coding
Startups	Easy-Medium	Selection, filtering, basic groupby	Take-home or live
Consulting	Easy-Medium	Pivot tables, merge, basic analysis	Case study + code

Pandas vs SQL: The Mental Map

Many interview questions ask you to solve a problem in both SQL and pandas. Here is the translation table you need to internalize:

# Pandas vs SQL - Quick Reference
# ================================

# SELECT columns        → df[['col1', 'col2']] or df.loc[:, ['col1', 'col2']]
# WHERE condition       → df[df['col'] > 5] or df.query('col > 5')
# GROUP BY + AGG        → df.groupby('col').agg({'val': 'sum'})
# ORDER BY              → df.sort_values('col', ascending=False)
# JOIN                  → pd.merge(df1, df2, on='key', how='inner')
# DISTINCT              → df['col'].unique() or df.drop_duplicates()
# HAVING                → grouped.filter(lambda x: len(x) > 5)
# CASE WHEN             → np.where(condition, val_if_true, val_if_false)
# WINDOW FUNCTIONS      → df.groupby('col')['val'].transform('sum')
# UNION ALL             → pd.concat([df1, df2], ignore_index=True)
# LIMIT                 → df.head(n)
# COALESCE              → df['col'].fillna(df['backup_col'])

Performance Patterns That Matter

Interviewers at senior levels test whether you know the performance implications of your pandas code. Here are the key patterns:

1. Vectorized Operations vs Loops

import pandas as pd
import numpy as np

df = pd.DataFrame({'value': range(100_000)})

# BAD: Iterating row by row - O(n) with huge constant
result = []
for idx, row in df.iterrows():
    result.append(row['value'] * 2)
df['doubled'] = result

# GOOD: Vectorized operation - uses NumPy under the hood
df['doubled'] = df['value'] * 2

# GOOD: Use .apply() only when vectorization is impossible
# (e.g., complex string parsing, external API calls)
df['label'] = df['value'].apply(lambda x: 'high' if x > 50000 else 'low')

2. Memory-Efficient Data Types

# Check memory usage
print(df.memory_usage(deep=True))

# Downcast numeric types to save memory
df['id'] = df['id'].astype('int32')           # 8 bytes → 4 bytes per value
df['price'] = df['price'].astype('float32')    # 8 bytes → 4 bytes per value

# Use categorical for low-cardinality string columns
df['status'] = df['status'].astype('category') # Huge savings for repeated strings

# Read only needed columns from CSV
df = pd.read_csv('large_file.csv', usecols=['id', 'name', 'value'])

3. Avoiding SettingWithCopyWarning

# BAD: Chained indexing creates ambiguous copies
df[df['status'] == 'active']['value'] = 100  # SettingWithCopyWarning!

# GOOD: Use .loc for assignment
df.loc[df['status'] == 'active', 'value'] = 100

# GOOD: Use .copy() when you intentionally want a copy
active_df = df[df['status'] == 'active'].copy()
active_df['value'] = 100

💡

Interview tip: When an interviewer asks you to solve a data problem, always clarify: (1) the size of the dataset, (2) whether memory matters, and (3) whether the solution needs to be SQL-translatable. These questions show you think like a production engineer, not just a notebook coder.

The 4-Step Approach for Pandas Challenges

Use this framework for every pandas problem in this course and in real interviews:

Step	Time	What to Do
1. Understand the Data	2 min	Look at df.head(), df.dtypes, df.shape. Identify the grain (one row = one what?). Ask about nulls.
2. Plan the Approach	2 min	State which pandas operations you will use. Compare alternatives (groupby vs pivot_table, merge vs join).
3. Write the Code	10 min	Write clean, chained pandas code. Use meaningful variable names. Prefer vectorized operations.
4. Verify	3 min	Check output shape, spot-check values, handle edge cases (empty groups, NaN in join keys).

Course Overview

Each lesson in this course follows a consistent structure:

Dataset setup — A realistic DataFrame you can paste into any notebook and run immediately
Problem statement — Clear description of the expected output, modeled after real interview questions
Solution with explanation — Complete pandas code with line-by-line commentary
SQL equivalent — The corresponding SQL query for comparison (where applicable)
Performance notes — When the approach matters for large datasets

📝

How to use this course: Type out every solution yourself in a Jupyter notebook or Python REPL. Modify the datasets and re-solve. After finishing a lesson, try solving each challenge again without looking at the answer. Repetition builds the fluency that makes interviews feel natural.

Quick Self-Assessment

Before starting, try this problem without looking at the solution:

📝

Problem: Given a DataFrame of employee records with columns name, department, and salary, find the top earner in each department. Return their name, department, and salary.

import pandas as pd

# Setup
employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'department': ['Engineering', 'Engineering', 'Sales', 'Sales', 'Marketing', 'Marketing'],
    'salary': [120000, 115000, 95000, 98000, 88000, 92000]
})

# Approach 1: idxmax (clean and fast)
idx = employees.groupby('department')['salary'].idxmax()
result = employees.loc[idx, ['name', 'department', 'salary']]

# Approach 2: rank + filter
employees['rank'] = employees.groupby('department')['salary'].rank(
    method='dense', ascending=False
)
result = employees[employees['rank'] == 1][['name', 'department', 'salary']]

# Approach 3: merge with max
max_salaries = employees.groupby('department')['salary'].max().reset_index()
result = employees.merge(max_salaries, on=['department', 'salary'])

print(result)
#       name   department  salary
# 0    Alice  Engineering  120000
# 3    Diana        Sales   98000
# 5    Frank    Marketing   92000

Why this matters for ML: Finding top-K items per group is a fundamental pattern in recommendation systems, anomaly detection, and model evaluation. The groupby + idxmax pattern is used daily for tasks like finding the best model per hyperparameter configuration or the most important feature per category.

Next → Data Selection & Filtering