Intermediate
Data Selection & Filtering
Selection and filtering are the most fundamental pandas operations. Every data manipulation task starts with selecting the right rows and columns. These 6 challenges cover the patterns you will encounter in real interviews.
Why this matters for ML: Feature selection, data splitting, outlier removal, and train/test filtering all rely on efficient data selection. In production ML pipelines, poorly written filters on large DataFrames cause memory blowups and slow training runs.
Shared Dataset for This Lesson
import pandas as pd
import numpy as np
# E-commerce orders dataset
orders = pd.DataFrame({
'order_id': range(1001, 1013),
'customer': ['Alice', 'Bob', 'Charlie', 'Alice', 'Diana', 'Bob',
'Eve', 'Charlie', 'Alice', 'Frank', 'Diana', 'Eve'],
'product': ['Laptop', 'Phone', 'Tablet', 'Phone', 'Laptop', 'Tablet',
'Phone', 'Laptop', 'Tablet', 'Phone', 'Tablet', 'Laptop'],
'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics',
'Electronics', 'Electronics', 'Electronics', 'Electronics',
'Electronics', 'Electronics', 'Electronics', 'Electronics'],
'price': [999.99, 699.99, 449.99, 699.99, 1299.99, 449.99,
599.99, 1099.99, 349.99, 799.99, 549.99, 1199.99],
'quantity': [1, 2, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1],
'status': ['completed', 'completed', 'pending', 'completed', 'shipped',
'completed', 'cancelled', 'shipped', 'completed', 'pending',
'completed', 'shipped'],
'region': ['West', 'East', 'West', 'West', 'South', 'East',
'North', 'West', 'West', 'South', 'East', 'North']
})
Challenge 1: loc vs iloc Selection
Problem: (a) Select the first 5 orders showing only
customer, product, and price. (b) Select the order at index position 3 using both label-based and position-based indexing. (c) Select every other row starting from the second row, showing only order_id and status.Solution
# (a) First 5 orders, specific columns
# Using loc (label-based) - includes both endpoints
result_a = orders.loc[0:4, ['customer', 'product', 'price']]
# Using iloc (position-based) - excludes end
result_a_alt = orders.iloc[:5, [1, 2, 4]] # columns by position
print(result_a)
# customer product price
# 0 Alice Laptop 999.99
# 1 Bob Phone 699.99
# 2 Charlie Tablet 449.99
# 3 Alice Phone 699.99
# 4 Diana Laptop 1299.99
# (b) Single row at index position 3
row_loc = orders.loc[3] # Returns Series (label-based)
row_iloc = orders.iloc[3] # Returns Series (position-based)
# When default RangeIndex, loc[3] and iloc[3] give the same result.
# They differ when the index is non-integer (e.g., after set_index).
# (c) Every other row starting from row index 1
result_c = orders.iloc[1::2][['order_id', 'status']]
print(result_c)
# order_id status
# 1 1002 completed
# 3 1004 completed
# 5 1006 completed
# 7 1008 shipped
# 9 1010 pending
# 11 1012 shipped
Key difference:
loc uses labels (inclusive on both ends). iloc uses integer positions (exclusive on the end, like Python slicing). When the index is a default RangeIndex they appear identical, but they diverge after operations like set_index(), sort_values(), or filtering.Challenge 2: Boolean Indexing
Problem: (a) Find all completed orders with a price above $600. (b) Find all orders where the customer name starts with a vowel. (c) Find orders where quantity is greater than the mean quantity.
Solution
# (a) Completed orders above $600
mask = (orders['status'] == 'completed') & (orders['price'] > 600)
expensive_completed = orders[mask]
print(expensive_completed[['customer', 'product', 'price', 'status']])
# customer product price status
# 0 Alice Laptop 999.99 completed
# 1 Bob Phone 699.99 completed
# 3 Alice Phone 699.99 completed
# NOTE: Use & (bitwise AND) not 'and' for pandas boolean masks
# Each condition must be wrapped in parentheses
# (b) Customer name starts with a vowel (A, E, I, O, U)
vowel_mask = orders['customer'].str[0].isin(['A', 'E', 'I', 'O', 'U'])
vowel_customers = orders[vowel_mask]
print(vowel_customers[['customer', 'product']])
# customer product
# 0 Alice Laptop
# 3 Alice Phone
# 6 Eve Phone
# 8 Alice Tablet
# 11 Eve Laptop
# (c) Orders with quantity above the mean
mean_qty = orders['quantity'].mean()
above_avg = orders[orders['quantity'] > mean_qty]
print(f"Mean quantity: {mean_qty:.2f}")
print(above_avg[['customer', 'product', 'quantity']])
# Mean quantity: 1.42
# customer product quantity
# 1 Bob Phone 2
# 5 Bob Tablet 3
# 6 Eve Phone 2
# 8 Alice Tablet 2
Challenge 3: The query() Method
Problem: Rewrite the following filters using the
query() method: (a) Orders from the West region with price > $500. (b) Orders where status is not 'cancelled'. (c) Orders where the customer is Alice or Bob and quantity ≥ 2. Use a variable for the customer list.Solution
# (a) West region, price > 500
# Boolean indexing version:
result_bool = orders[(orders['region'] == 'West') & (orders['price'] > 500)]
# query() version - cleaner syntax, reads like SQL
result_query = orders.query("region == 'West' and price > 500")
print(result_query[['customer', 'product', 'price', 'region']])
# customer product price region
# 0 Alice Laptop 999.99 West
# 3 Alice Phone 699.99 West
# 7 Charlie Laptop 1099.99 West
# (b) Status is not cancelled
result_b = orders.query("status != 'cancelled'")
print(f"Rows excluding cancelled: {len(result_b)}") # 11
# (c) Using a variable inside query with @
target_customers = ['Alice', 'Bob']
result_c = orders.query("customer in @target_customers and quantity >= 2")
print(result_c[['customer', 'product', 'quantity']])
# customer product quantity
# 1 Bob Phone 2
# 5 Bob Tablet 3
# 8 Alice Tablet 2
When to use query(): Use
query() for readability when you have multiple conditions. It avoids the parentheses and ampersand clutter of boolean indexing. Use @variable to reference Python variables inside the query string. For performance on large DataFrames, query() can be faster because it uses numexpr under the hood.Challenge 4: Multi-Condition Filtering
Problem: Write a function that filters orders based on dynamic criteria: any combination of status, minimum price, maximum price, and region. All parameters are optional. Return matching orders sorted by price descending.
Solution
def filter_orders(df, status=None, min_price=None, max_price=None, region=None):
"""Filter orders with dynamic, optional criteria.
This pattern is common in production: building query conditions
dynamically based on user input or API parameters.
"""
mask = pd.Series(True, index=df.index) # Start with all True
if status is not None:
mask &= df['status'] == status
if min_price is not None:
mask &= df['price'] >= min_price
if max_price is not None:
mask &= df['price'] <= max_price
if region is not None:
mask &= df['region'] == region
return df[mask].sort_values('price', ascending=False)
# Test case 1: Completed orders between $400 and $800
result1 = filter_orders(orders, status='completed', min_price=400, max_price=800)
print(result1[['customer', 'product', 'price', 'status']])
# customer product price status
# 1 Bob Phone 699.99 completed
# 3 Alice Phone 699.99 completed
# 5 Bob Tablet 449.99 completed
# Test case 2: All orders from West region
result2 = filter_orders(orders, region='West')
print(result2[['customer', 'product', 'price', 'region']])
# customer product price region
# 7 Charlie Laptop 1099.99 West
# 0 Alice Laptop 999.99 West
# 3 Alice Phone 699.99 West
# 2 Charlie Tablet 449.99 West
# 8 Alice Tablet 349.99 West
# Test case 3: No filters - return all sorted by price
result3 = filter_orders(orders)
print(f"All orders count: {len(result3)}") # 12
Challenge 5: isin() for Set Membership
Problem: (a) Find all orders for products in ['Laptop', 'Phone']. (b) Find all orders whose status is NOT in ['cancelled', 'pending']. (c) Given a list of VIP customer IDs, filter to only their orders and compute total spend per VIP.
Solution
# (a) Products in a specific set
target_products = ['Laptop', 'Phone']
laptops_phones = orders[orders['product'].isin(target_products)]
print(laptops_phones[['customer', 'product', 'price']])
# customer product price
# 0 Alice Laptop 999.99
# 1 Bob Phone 699.99
# 3 Alice Phone 699.99
# 4 Diana Laptop 1299.99
# 6 Eve Phone 599.99
# 7 Charlie Laptop 1099.99
# 9 Frank Phone 799.99
# 11 Eve Laptop 1199.99
# (b) Exclude specific statuses using ~isin()
active_orders = orders[~orders['status'].isin(['cancelled', 'pending'])]
print(f"Active orders: {len(active_orders)}") # 9
print(active_orders['status'].value_counts())
# completed 6
# shipped 3
# (c) VIP customer analysis
vip_customers = ['Alice', 'Bob', 'Diana']
vip_orders = orders[orders['customer'].isin(vip_customers)]
# Total spend = price * quantity
vip_orders = vip_orders.copy()
vip_orders['total'] = vip_orders['price'] * vip_orders['quantity']
vip_spend = vip_orders.groupby('customer')['total'].sum().sort_values(ascending=False)
print(vip_spend)
# customer
# Bob 2749.94
# Alice 2749.96
# Diana 1849.98
# Name: total, dtype: float64
Challenge 6: between() for Range Filtering
Problem: (a) Find orders with prices between $500 and $1000 (inclusive). (b) Create price tiers: 'Budget' (<$500), 'Mid' ($500-$999), 'Premium' (≥$1000). Count orders per tier. (c) Find customers who have placed orders in at least 2 different price tiers.
Solution
# (a) Price between $500 and $1000
mid_range = orders[orders['price'].between(500, 1000)]
print(mid_range[['customer', 'product', 'price']])
# customer product price
# 0 Alice Laptop 999.99
# 1 Bob Phone 699.99
# 3 Alice Phone 699.99
# 6 Eve Phone 599.99
# 9 Frank Phone 799.99
# 10 Diana Tablet 549.99
# (b) Create price tiers using pd.cut
orders['price_tier'] = pd.cut(
orders['price'],
bins=[0, 500, 1000, float('inf')],
labels=['Budget', 'Mid', 'Premium'],
right=False # [0, 500), [500, 1000), [1000, inf)
)
print(orders['price_tier'].value_counts().sort_index())
# Budget 3
# Mid 6
# Premium 3
# Alternative: using np.select for more control
import numpy as np
conditions = [
orders['price'] < 500,
orders['price'].between(500, 999.99),
orders['price'] >= 1000
]
choices = ['Budget', 'Mid', 'Premium']
orders['tier_v2'] = np.select(conditions, choices, default='Unknown')
# (c) Customers in at least 2 price tiers
customer_tiers = orders.groupby('customer')['price_tier'].nunique()
multi_tier = customer_tiers[customer_tiers >= 2]
print(multi_tier)
# customer
# Alice 2
# Bob 2
# Charlie 2
# Diana 2
# Eve 2
# Name: price_tier, dtype: int64
# Get the actual tiers per customer
tier_detail = (orders.groupby('customer')['price_tier']
.apply(lambda x: set(x))
.reset_index())
tier_detail.columns = ['customer', 'tiers']
print(tier_detail)
# SQL equivalent:
# SELECT customer, COUNT(DISTINCT price_tier) AS tier_count
# FROM orders
# GROUP BY customer
# HAVING COUNT(DISTINCT price_tier) >= 2
Performance note:
between() is syntactic sugar for (df['col'] >= low) & (df['col'] <= high). For binning continuous values, pd.cut() is vectorized and much faster than applying a custom function row-by-row. For large DataFrames (millions of rows), always prefer pd.cut() or np.select() over .apply().
Lilly Tech Systems