Pandas & Python Coding
10 hands-on coding challenges covering data manipulation, groupby, merge, pivot tables, time series, and visualization. Each challenge includes a complete solution with explanation.
Challenge 1: Clean and deduplicate a user dataset
Task: Given a DataFrame with duplicate user records, missing emails, and inconsistent name formatting, clean the data.
import pandas as pd
# Remove exact duplicates
df = df.drop_duplicates()
# Remove duplicates by user_id, keeping the most recent record
df = df.sort_values('created_at', ascending=False)
df = df.drop_duplicates(subset='user_id', keep='first')
# Standardize name formatting
df['name'] = df['name'].str.strip().str.title()
# Handle missing emails
df['email'] = df['email'].fillna('unknown')
# Or drop rows with missing emails:
# df = df.dropna(subset=['email'])
# Validate email format
df['valid_email'] = df['email'].str.contains(
r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
regex=True, na=False
)
print(f"Cleaned: {len(df)} records, {df['valid_email'].sum()} valid emails")
Explanation: Data cleaning is usually the first task in any data science workflow. Key pandas methods: drop_duplicates() with subset for targeted deduplication, str.strip().str.title() for string normalization, fillna() and dropna() for missing data, and str.contains() with regex for validation. In interviews, always ask about the business rules for deduplication (which record to keep?) and missing data (impute, drop, or flag?).Challenge 2: GroupBy aggregation with multiple statistics
Task: Calculate total revenue, average order value, and order count per customer segment per month.
import pandas as pd
# Create month column
df['month'] = df['order_date'].dt.to_period('M')
# Multi-aggregation with named output
summary = df.groupby(['segment', 'month']).agg(
total_revenue=('revenue', 'sum'),
avg_order_value=('revenue', 'mean'),
order_count=('order_id', 'nunique'),
unique_customers=('customer_id', 'nunique')
).reset_index()
# Add revenue per customer metric
summary['revenue_per_customer'] = (
summary['total_revenue'] / summary['unique_customers']
)
# Sort for readability
summary = summary.sort_values(['segment', 'month'])
print(summary)
Explanation: The agg() method with named aggregation (keyword = (column, function) syntax) is the cleanest way to compute multiple statistics. nunique counts distinct values. Adding derived columns after aggregation (revenue per customer) is a common pattern. In interviews, demonstrate you know the difference between count() (non-null values) and nunique() (distinct values).Challenge 3: Merge two DataFrames with different join types
Task: Combine user and order data to find users with no orders and orders with no matching user.
import pandas as pd
# Inner join: only users who have orders
active_users = pd.merge(users, orders, on='user_id', how='inner')
# Left join: all users, with order info where available
all_users = pd.merge(users, orders, on='user_id', how='left')
# Users with no orders (anti-join pattern)
no_orders = all_users[all_users['order_id'].isna()]
print(f"Users with no orders: {len(no_orders)}")
# Outer join: find orphaned orders (no matching user)
full = pd.merge(users, orders, on='user_id', how='outer', indicator=True)
orphaned_orders = full[full['_merge'] == 'right_only']
print(f"Orphaned orders: {len(orphaned_orders)}")
# Multiple key merge
merged = pd.merge(
df1, df2,
left_on=['user_id', 'date'],
right_on=['uid', 'event_date'],
how='left'
)
Explanation: The indicator=True parameter adds a _merge column showing which table each row came from ('left_only', 'right_only', 'both'). This is the pandas equivalent of SQL's LEFT/RIGHT/FULL OUTER JOIN. The anti-join pattern (left join + filter for NaN) is extremely common for finding "users who did NOT do X." Always check for unexpected row duplication after merges — many-to-many joins can silently explode your row count.Challenge 4: Create a pivot table for cohort analysis
Task: Build a user retention cohort table showing the percentage of users retained each month after signup.
import pandas as pd
# Assign each user to their signup cohort (month)
df['signup_cohort'] = df['signup_date'].dt.to_period('M')
df['activity_month'] = df['activity_date'].dt.to_period('M')
# Calculate months since signup
df['months_since_signup'] = (
(df['activity_month'] - df['signup_cohort']).apply(lambda x: x.n)
)
# Count unique users per cohort and month offset
cohort_data = df.groupby(['signup_cohort', 'months_since_signup']).agg(
active_users=('user_id', 'nunique')
).reset_index()
# Get cohort sizes (month 0)
cohort_sizes = cohort_data[cohort_data['months_since_signup'] == 0][
['signup_cohort', 'active_users']
].rename(columns={'active_users': 'cohort_size'})
# Merge and calculate retention percentage
cohort_data = cohort_data.merge(cohort_sizes, on='signup_cohort')
cohort_data['retention_pct'] = (
100 * cohort_data['active_users'] / cohort_data['cohort_size']
).round(1)
# Pivot to retention table
retention_table = cohort_data.pivot_table(
index='signup_cohort',
columns='months_since_signup',
values='retention_pct'
)
print(retention_table)
Explanation: Cohort analysis is one of the most important analyses a data scientist performs. The key steps: (1) assign cohorts based on signup month, (2) calculate the time offset from signup, (3) count active users per cohort per offset, (4) divide by cohort size to get retention percentages, (5) pivot into a readable table. The resulting triangle-shaped table shows how each cohort retains over time. In interviews, discuss what the retention curve shape tells you (steep early drop = onboarding problem, gradual decay = normal, flattening = found product-market fit).Challenge 5: Time series resampling and rolling statistics
Task: Resample daily event data to weekly, compute 4-week rolling average, and identify anomalies.
import pandas as pd
import numpy as np
# Ensure datetime index
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()
# Resample to weekly totals
weekly = df['events'].resample('W').sum()
# 4-week rolling average and standard deviation
weekly_stats = pd.DataFrame({
'events': weekly,
'rolling_avg': weekly.rolling(window=4).mean(),
'rolling_std': weekly.rolling(window=4).std()
})
# Flag anomalies: more than 2 standard deviations from rolling average
weekly_stats['is_anomaly'] = (
abs(weekly_stats['events'] - weekly_stats['rolling_avg'])
> 2 * weekly_stats['rolling_std']
)
# Percentage change week-over-week
weekly_stats['wow_change'] = weekly_stats['events'].pct_change() * 100
anomalies = weekly_stats[weekly_stats['is_anomaly']]
print(f"Found {len(anomalies)} anomalous weeks")
print(anomalies)
Explanation: Time series operations are critical in data science. resample() handles temporal aggregation (like SQL's DATE_TRUNC + GROUP BY). rolling() computes moving window statistics. The anomaly detection uses a simple z-score approach (events more than 2 standard deviations from the rolling mean). In a real setting, you might use more sophisticated methods (STL decomposition, Prophet) but this simple approach works well in interviews and demonstrates the core concepts.Challenge 6: Apply custom functions with transform and apply
Task: Normalize revenue within each customer segment and flag the top 10% of customers by spending.
import pandas as pd
# Z-score normalization within each segment using transform
df['revenue_zscore'] = df.groupby('segment')['revenue'].transform(
lambda x: (x - x.mean()) / x.std()
)
# Percentile rank within segment
df['revenue_percentile'] = df.groupby('segment')['revenue'].transform(
lambda x: x.rank(pct=True)
)
# Flag top 10% spenders per segment
df['is_top_spender'] = df['revenue_percentile'] >= 0.90
# Custom aggregation with apply (returns DataFrame)
def segment_summary(group):
return pd.Series({
'total_revenue': group['revenue'].sum(),
'median_revenue': group['revenue'].median(),
'top_spender_revenue': group.loc[
group['revenue_percentile'] >= 0.90, 'revenue'
].sum(),
'top_spender_pct': (
group.loc[group['revenue_percentile'] >= 0.90, 'revenue'].sum()
/ group['revenue'].sum() * 100
)
})
segment_report = df.groupby('segment').apply(segment_summary).reset_index()
print(segment_report)
Explanation: transform() returns a Series with the same index as the input, making it perfect for adding normalized columns. apply() is more flexible and can return any shape. The key difference: transform operates element-wise and must return the same size; apply can return aggregated results. rank(pct=True) gives percentile ranks directly. In interviews, knowing when to use transform vs apply vs agg demonstrates pandas fluency.Challenge 7: Handle missing data strategically
Task: Impute missing values using different strategies based on column type and missingness pattern.
import pandas as pd
import numpy as np
# Analyze missingness pattern
missing_report = pd.DataFrame({
'missing_count': df.isnull().sum(),
'missing_pct': (df.isnull().sum() / len(df) * 100).round(2),
'dtype': df.dtypes
})
print(missing_report[missing_report['missing_count'] > 0])
# Strategy 1: Forward-fill for time series data
df['stock_price'] = df['stock_price'].fillna(method='ffill')
# Strategy 2: Group-specific median for numerical columns
df['income'] = df.groupby('region')['income'].transform(
lambda x: x.fillna(x.median())
)
# Strategy 3: Mode for categorical columns
df['category'] = df['category'].fillna(df['category'].mode()[0])
# Strategy 4: Flag and impute (preserve missingness information)
df['age_was_missing'] = df['age'].isnull().astype(int)
df['age'] = df['age'].fillna(df['age'].median())
# Strategy 5: Drop rows only if key columns are missing
df = df.dropna(subset=['user_id', 'event_date'])
# Verify no remaining nulls in critical columns
assert df[['user_id', 'event_date']].isnull().sum().sum() == 0
Explanation: There is no one-size-fits-all approach to missing data. The strategy depends on: (1) the type of missingness (MCAR, MAR, MNAR), (2) the percentage missing (if >50%, the column may not be useful), (3) the column type (numerical vs categorical), and (4) downstream usage. Creating a missing indicator column preserves information about missingness, which can be predictive. In interviews, discuss why you chose each strategy and what could go wrong with naive imputation (e.g., filling with mean can distort distributions and underestimate variance).Challenge 8: Efficient string operations on large DataFrames
Task: Parse structured text data, extract features, and categorize using string operations.
import pandas as pd
# Extract domain from email addresses
df['email_domain'] = df['email'].str.split('@').str[1].str.lower()
# Extract components from a structured string (e.g., "US-CA-94105")
df[['country', 'state', 'zipcode']] = df['location'].str.split('-', expand=True)
# Categorize using str.contains with regex
df['device_type'] = np.where(
df['user_agent'].str.contains('Mobile|Android|iPhone', case=False, na=False),
'mobile',
np.where(
df['user_agent'].str.contains('Tablet|iPad', case=False, na=False),
'tablet', 'desktop'
)
)
# Extract numbers from text
df['price'] = df['price_text'].str.extract(r'(\d+\.?\d*)').astype(float)
# Vectorized string operations are much faster than apply
# BAD: df['name_len'] = df['name'].apply(lambda x: len(x))
# GOOD:
df['name_len'] = df['name'].str.len()
Explanation: Pandas str accessor provides vectorized string operations that are much faster than apply() with lambdas. Key methods: str.split() with expand=True for splitting into columns, str.contains() for pattern matching, str.extract() for regex capture groups. Always use vectorized operations over apply when possible — for a million rows, vectorized operations can be 10-100x faster. In interviews, mentioning performance considerations shows production-level awareness.Challenge 9: Build a funnel analysis
Task: Calculate conversion rates through a multi-step user funnel (visit, signup, activation, purchase).
import pandas as pd
# Events DataFrame: user_id, event_name, event_date
funnel_steps = ['page_visit', 'signup', 'activation', 'purchase']
# Count unique users at each step
funnel = pd.DataFrame({
'step': funnel_steps,
'users': [
df[df['event_name'] == step]['user_id'].nunique()
for step in funnel_steps
]
})
# Calculate conversion rates
funnel['conversion_from_prev'] = (
funnel['users'] / funnel['users'].shift(1) * 100
).round(1)
funnel['conversion_from_top'] = (
funnel['users'] / funnel['users'].iloc[0] * 100
).round(1)
funnel['dropoff_pct'] = (100 - funnel['conversion_from_prev']).round(1)
print(funnel)
# Segmented funnel (e.g., by traffic source)
def compute_funnel(group):
steps = {}
for step in funnel_steps:
steps[step] = group[group['event_name'] == step]['user_id'].nunique()
return pd.Series(steps)
segmented = df.groupby('source').apply(compute_funnel).reset_index()
# Convert counts to percentages relative to first step
for step in funnel_steps[1:]:
segmented[f'{step}_pct'] = (
segmented[step] / segmented[funnel_steps[0]] * 100
).round(1)
print(segmented)
Explanation: Funnel analysis identifies where users drop off in a multi-step process. Two conversion rates matter: step-over-step (what % moved from the previous step) and top-of-funnel (what % of all visitors reached this step). Segmenting by source, device, or cohort reveals where specific user groups struggle. In interviews, discuss: (1) whether the funnel should be strict-order (user must do steps in sequence) or flexible, (2) time windows (should the purchase happen within 7 days of signup?), and (3) how to identify the biggest improvement opportunity (the step with the largest absolute drop-off, not necessarily the largest percentage drop-off).Challenge 10: Create a visualization-ready summary
Task: Prepare data for a dashboard showing KPIs, trends, and distributions using matplotlib.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Revenue trend with rolling average
daily_rev = df.groupby('date')['revenue'].sum()
axes[0, 0].plot(daily_rev.index, daily_rev.values, alpha=0.3, label='Daily')
axes[0, 0].plot(daily_rev.rolling(7).mean(), color='red', label='7-day avg')
axes[0, 0].set_title('Daily Revenue Trend')
axes[0, 0].legend()
# 2. Revenue distribution (log scale for skewed data)
axes[0, 1].hist(df['revenue'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Revenue Distribution')
axes[0, 1].set_xlabel('Revenue ($)')
axes[0, 1].set_yscale('log')
# 3. Conversion by segment (bar chart)
conv_by_segment = df.groupby('segment').agg(
conversion_rate=('converted', 'mean')
).sort_values('conversion_rate', ascending=True)
axes[1, 0].barh(conv_by_segment.index, conv_by_segment['conversion_rate'] * 100)
axes[1, 0].set_title('Conversion Rate by Segment')
axes[1, 0].set_xlabel('Conversion Rate (%)')
# 4. Correlation heatmap for key metrics
metrics = ['revenue', 'sessions', 'page_views', 'time_on_site']
corr = df[metrics].corr()
im = axes[1, 1].imshow(corr, cmap='RdBu_r', vmin=-1, vmax=1)
axes[1, 1].set_xticks(range(len(metrics)))
axes[1, 1].set_yticks(range(len(metrics)))
axes[1, 1].set_xticklabels(metrics, rotation=45, ha='right')
axes[1, 1].set_yticklabels(metrics)
plt.colorbar(im, ax=axes[1, 1])
axes[1, 1].set_title('Metric Correlations')
plt.tight_layout()
plt.savefig('dashboard.png', dpi=150, bbox_inches='tight')
plt.show()
Explanation: Data visualization in interviews tests whether you can choose the right chart for the right data: line charts for trends over time, histograms for distributions (use log scale for skewed data), bar charts for categorical comparisons, and heatmaps for correlations. Always label axes, add titles, and consider the audience. Using tight_layout() prevents overlapping elements. In take-home assignments, polished visualizations with clear takeaways significantly improve your evaluation.
Lilly Tech Systems