Intermediate

Data Selection & Filtering

Selection and filtering are the most fundamental pandas operations. Every data manipulation task starts with selecting the right rows and columns. These 6 challenges cover the patterns you will encounter in real interviews.

💡
Why this matters for ML: Feature selection, data splitting, outlier removal, and train/test filtering all rely on efficient data selection. In production ML pipelines, poorly written filters on large DataFrames cause memory blowups and slow training runs.

Shared Dataset for This Lesson

import pandas as pd
import numpy as np

# E-commerce orders dataset
orders = pd.DataFrame({
    'order_id': range(1001, 1013),
    'customer': ['Alice', 'Bob', 'Charlie', 'Alice', 'Diana', 'Bob',
                 'Eve', 'Charlie', 'Alice', 'Frank', 'Diana', 'Eve'],
    'product': ['Laptop', 'Phone', 'Tablet', 'Phone', 'Laptop', 'Tablet',
                'Phone', 'Laptop', 'Tablet', 'Phone', 'Tablet', 'Laptop'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics',
                 'Electronics', 'Electronics', 'Electronics', 'Electronics',
                 'Electronics', 'Electronics', 'Electronics', 'Electronics'],
    'price': [999.99, 699.99, 449.99, 699.99, 1299.99, 449.99,
              599.99, 1099.99, 349.99, 799.99, 549.99, 1199.99],
    'quantity': [1, 2, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1],
    'status': ['completed', 'completed', 'pending', 'completed', 'shipped',
               'completed', 'cancelled', 'shipped', 'completed', 'pending',
               'completed', 'shipped'],
    'region': ['West', 'East', 'West', 'West', 'South', 'East',
               'North', 'West', 'West', 'South', 'East', 'North']
})

Challenge 1: loc vs iloc Selection

📝
Problem: (a) Select the first 5 orders showing only customer, product, and price. (b) Select the order at index position 3 using both label-based and position-based indexing. (c) Select every other row starting from the second row, showing only order_id and status.

Solution

# (a) First 5 orders, specific columns
# Using loc (label-based) - includes both endpoints
result_a = orders.loc[0:4, ['customer', 'product', 'price']]

# Using iloc (position-based) - excludes end
result_a_alt = orders.iloc[:5, [1, 2, 4]]  # columns by position

print(result_a)
#   customer product    price
# 0    Alice  Laptop   999.99
# 1      Bob   Phone   699.99
# 2  Charlie  Tablet   449.99
# 3    Alice   Phone   699.99
# 4    Diana  Laptop  1299.99


# (b) Single row at index position 3
row_loc = orders.loc[3]          # Returns Series (label-based)
row_iloc = orders.iloc[3]        # Returns Series (position-based)
# When default RangeIndex, loc[3] and iloc[3] give the same result.
# They differ when the index is non-integer (e.g., after set_index).


# (c) Every other row starting from row index 1
result_c = orders.iloc[1::2][['order_id', 'status']]
print(result_c)
#     order_id      status
# 1       1002   completed
# 3       1004   completed
# 5       1006   completed
# 7       1008     shipped
# 9       1010     pending
# 11      1012     shipped
💡
Key difference: loc uses labels (inclusive on both ends). iloc uses integer positions (exclusive on the end, like Python slicing). When the index is a default RangeIndex they appear identical, but they diverge after operations like set_index(), sort_values(), or filtering.

Challenge 2: Boolean Indexing

📝
Problem: (a) Find all completed orders with a price above $600. (b) Find all orders where the customer name starts with a vowel. (c) Find orders where quantity is greater than the mean quantity.

Solution

# (a) Completed orders above $600
mask = (orders['status'] == 'completed') & (orders['price'] > 600)
expensive_completed = orders[mask]
print(expensive_completed[['customer', 'product', 'price', 'status']])
#   customer product    price     status
# 0    Alice  Laptop   999.99  completed
# 1      Bob   Phone   699.99  completed
# 3    Alice   Phone   699.99  completed

# NOTE: Use & (bitwise AND) not 'and' for pandas boolean masks
# Each condition must be wrapped in parentheses


# (b) Customer name starts with a vowel (A, E, I, O, U)
vowel_mask = orders['customer'].str[0].isin(['A', 'E', 'I', 'O', 'U'])
vowel_customers = orders[vowel_mask]
print(vowel_customers[['customer', 'product']])
#   customer product
# 0    Alice  Laptop
# 3    Alice   Phone
# 6      Eve   Phone
# 8    Alice  Tablet
# 11     Eve  Laptop


# (c) Orders with quantity above the mean
mean_qty = orders['quantity'].mean()
above_avg = orders[orders['quantity'] > mean_qty]
print(f"Mean quantity: {mean_qty:.2f}")
print(above_avg[['customer', 'product', 'quantity']])
# Mean quantity: 1.42
#   customer product  quantity
# 1      Bob   Phone         2
# 5      Bob  Tablet         3
# 6      Eve   Phone         2
# 8    Alice  Tablet         2

Challenge 3: The query() Method

📝
Problem: Rewrite the following filters using the query() method: (a) Orders from the West region with price > $500. (b) Orders where status is not 'cancelled'. (c) Orders where the customer is Alice or Bob and quantity ≥ 2. Use a variable for the customer list.

Solution

# (a) West region, price > 500
# Boolean indexing version:
result_bool = orders[(orders['region'] == 'West') & (orders['price'] > 500)]

# query() version - cleaner syntax, reads like SQL
result_query = orders.query("region == 'West' and price > 500")
print(result_query[['customer', 'product', 'price', 'region']])
#   customer product    price region
# 0    Alice  Laptop   999.99   West
# 3    Alice   Phone   699.99   West
# 7  Charlie  Laptop  1099.99   West


# (b) Status is not cancelled
result_b = orders.query("status != 'cancelled'")
print(f"Rows excluding cancelled: {len(result_b)}")  # 11


# (c) Using a variable inside query with @
target_customers = ['Alice', 'Bob']
result_c = orders.query("customer in @target_customers and quantity >= 2")
print(result_c[['customer', 'product', 'quantity']])
#   customer product  quantity
# 1      Bob   Phone         2
# 5      Bob  Tablet         3
# 8    Alice  Tablet         2
💡
When to use query(): Use query() for readability when you have multiple conditions. It avoids the parentheses and ampersand clutter of boolean indexing. Use @variable to reference Python variables inside the query string. For performance on large DataFrames, query() can be faster because it uses numexpr under the hood.

Challenge 4: Multi-Condition Filtering

📝
Problem: Write a function that filters orders based on dynamic criteria: any combination of status, minimum price, maximum price, and region. All parameters are optional. Return matching orders sorted by price descending.

Solution

def filter_orders(df, status=None, min_price=None, max_price=None, region=None):
    """Filter orders with dynamic, optional criteria.

    This pattern is common in production: building query conditions
    dynamically based on user input or API parameters.
    """
    mask = pd.Series(True, index=df.index)  # Start with all True

    if status is not None:
        mask &= df['status'] == status
    if min_price is not None:
        mask &= df['price'] >= min_price
    if max_price is not None:
        mask &= df['price'] <= max_price
    if region is not None:
        mask &= df['region'] == region

    return df[mask].sort_values('price', ascending=False)


# Test case 1: Completed orders between $400 and $800
result1 = filter_orders(orders, status='completed', min_price=400, max_price=800)
print(result1[['customer', 'product', 'price', 'status']])
#   customer product   price     status
# 1      Bob   Phone  699.99  completed
# 3    Alice   Phone  699.99  completed
# 5      Bob  Tablet  449.99  completed

# Test case 2: All orders from West region
result2 = filter_orders(orders, region='West')
print(result2[['customer', 'product', 'price', 'region']])
#   customer product    price region
# 7  Charlie  Laptop  1099.99   West
# 0    Alice  Laptop   999.99   West
# 3    Alice   Phone   699.99   West
# 2  Charlie  Tablet   449.99   West
# 8    Alice  Tablet   349.99   West

# Test case 3: No filters - return all sorted by price
result3 = filter_orders(orders)
print(f"All orders count: {len(result3)}")  # 12

Challenge 5: isin() for Set Membership

📝
Problem: (a) Find all orders for products in ['Laptop', 'Phone']. (b) Find all orders whose status is NOT in ['cancelled', 'pending']. (c) Given a list of VIP customer IDs, filter to only their orders and compute total spend per VIP.

Solution

# (a) Products in a specific set
target_products = ['Laptop', 'Phone']
laptops_phones = orders[orders['product'].isin(target_products)]
print(laptops_phones[['customer', 'product', 'price']])
#    customer product    price
# 0     Alice  Laptop   999.99
# 1       Bob   Phone   699.99
# 3     Alice   Phone   699.99
# 4     Diana  Laptop  1299.99
# 6       Eve   Phone   599.99
# 7   Charlie  Laptop  1099.99
# 9     Frank   Phone   799.99
# 11      Eve  Laptop  1199.99


# (b) Exclude specific statuses using ~isin()
active_orders = orders[~orders['status'].isin(['cancelled', 'pending'])]
print(f"Active orders: {len(active_orders)}")  # 9
print(active_orders['status'].value_counts())
# completed    6
# shipped      3


# (c) VIP customer analysis
vip_customers = ['Alice', 'Bob', 'Diana']
vip_orders = orders[orders['customer'].isin(vip_customers)]

# Total spend = price * quantity
vip_orders = vip_orders.copy()
vip_orders['total'] = vip_orders['price'] * vip_orders['quantity']
vip_spend = vip_orders.groupby('customer')['total'].sum().sort_values(ascending=False)

print(vip_spend)
# customer
# Bob        2749.94
# Alice      2749.96
# Diana      1849.98
# Name: total, dtype: float64

Challenge 6: between() for Range Filtering

📝
Problem: (a) Find orders with prices between $500 and $1000 (inclusive). (b) Create price tiers: 'Budget' (<$500), 'Mid' ($500-$999), 'Premium' (≥$1000). Count orders per tier. (c) Find customers who have placed orders in at least 2 different price tiers.

Solution

# (a) Price between $500 and $1000
mid_range = orders[orders['price'].between(500, 1000)]
print(mid_range[['customer', 'product', 'price']])
#   customer product   price
# 0    Alice  Laptop  999.99
# 1      Bob   Phone  699.99
# 3    Alice   Phone  699.99
# 6      Eve   Phone  599.99
# 9    Frank   Phone  799.99
# 10   Diana  Tablet  549.99


# (b) Create price tiers using pd.cut
orders['price_tier'] = pd.cut(
    orders['price'],
    bins=[0, 500, 1000, float('inf')],
    labels=['Budget', 'Mid', 'Premium'],
    right=False  # [0, 500), [500, 1000), [1000, inf)
)
print(orders['price_tier'].value_counts().sort_index())
# Budget     3
# Mid        6
# Premium    3

# Alternative: using np.select for more control
import numpy as np
conditions = [
    orders['price'] < 500,
    orders['price'].between(500, 999.99),
    orders['price'] >= 1000
]
choices = ['Budget', 'Mid', 'Premium']
orders['tier_v2'] = np.select(conditions, choices, default='Unknown')


# (c) Customers in at least 2 price tiers
customer_tiers = orders.groupby('customer')['price_tier'].nunique()
multi_tier = customer_tiers[customer_tiers >= 2]
print(multi_tier)
# customer
# Alice      2
# Bob        2
# Charlie    2
# Diana      2
# Eve        2
# Name: price_tier, dtype: int64

# Get the actual tiers per customer
tier_detail = (orders.groupby('customer')['price_tier']
               .apply(lambda x: set(x))
               .reset_index())
tier_detail.columns = ['customer', 'tiers']
print(tier_detail)
# SQL equivalent:
# SELECT customer, COUNT(DISTINCT price_tier) AS tier_count
# FROM orders
# GROUP BY customer
# HAVING COUNT(DISTINCT price_tier) >= 2
💡
Performance note: between() is syntactic sugar for (df['col'] >= low) & (df['col'] <= high). For binning continuous values, pd.cut() is vectorized and much faster than applying a custom function row-by-row. For large DataFrames (millions of rows), always prefer pd.cut() or np.select() over .apply().