Intermediate

Window Functions

Window functions are the single most differentiating SQL skill in data/ML interviews. They let you compute values across related rows without collapsing the result set. These 6 challenges cover every window function pattern you need.

💡
Why this matters for ML: Feature engineering for time-series models relies heavily on window functions: lag features, rolling averages, cumulative sums, and row numbering for deduplication. ML engineers who can write these in SQL skip the pandas bottleneck and compute features at warehouse scale.

Shared Schema for This Lesson

CREATE TABLE daily_sales (
    sale_id INT PRIMARY KEY,
    salesperson VARCHAR(50),
    region VARCHAR(30),
    sale_date DATE,
    amount DECIMAL(10,2),
    product VARCHAR(50)
);

INSERT INTO daily_sales VALUES
(1, 'Alice', 'West', '2024-01-01', 1500, 'Enterprise'),
(2, 'Bob', 'East', '2024-01-01', 1200, 'Pro'),
(3, 'Charlie', 'West', '2024-01-01', 900, 'Starter'),
(4, 'Alice', 'West', '2024-01-02', 2100, 'Enterprise'),
(5, 'Bob', 'East', '2024-01-02', 800, 'Starter'),
(6, 'Charlie', 'West', '2024-01-02', 1800, 'Pro'),
(7, 'Alice', 'West', '2024-01-03', 1700, 'Pro'),
(8, 'Bob', 'East', '2024-01-03', 2200, 'Enterprise'),
(9, 'Charlie', 'West', '2024-01-03', 1100, 'Pro'),
(10, 'Alice', 'West', '2024-01-04', 1900, 'Enterprise'),
(11, 'Bob', 'East', '2024-01-04', 1500, 'Pro'),
(12, 'Charlie', 'West', '2024-01-04', 2000, 'Enterprise'),
(13, 'Alice', 'West', '2024-01-05', 1300, 'Starter'),
(14, 'Bob', 'East', '2024-01-05', 1800, 'Enterprise'),
(15, 'Charlie', 'West', '2024-01-05', 1600, 'Pro');

Challenge 1: ROW_NUMBER vs RANK vs DENSE_RANK

📝
Problem: Rank all salespersons by their total sales amount. Show the difference between ROW_NUMBER, RANK, and DENSE_RANK. Then find the top 1 salesperson overall.

Solution

-- Step 1: Compute total sales per person, then apply all three ranking functions
WITH totals AS (
    SELECT
        salesperson,
        SUM(amount) AS total_sales
    FROM daily_sales
    GROUP BY salesperson
)
SELECT
    salesperson,
    total_sales,
    ROW_NUMBER() OVER (ORDER BY total_sales DESC) AS row_num,
    RANK() OVER (ORDER BY total_sales DESC) AS rank_val,
    DENSE_RANK() OVER (ORDER BY total_sales DESC) AS dense_rank_val
FROM totals;

-- Result:
-- salesperson | total_sales | row_num | rank_val | dense_rank_val
-- ------------+-------------+---------+----------+---------------
-- Alice       | 8500        | 1       | 1        | 1
-- Bob         | 7500        | 2       | 2        | 2
-- Charlie     | 7400        | 3       | 3        | 3

-- To find top 1 (with ties):
SELECT salesperson, total_sales
FROM (
    SELECT salesperson, SUM(amount) AS total_sales,
           RANK() OVER (ORDER BY SUM(amount) DESC) AS rk
    FROM daily_sales
    GROUP BY salesperson
) ranked
WHERE rk = 1;

-- Result:
-- salesperson | total_sales
-- ------------+------------
-- Alice       | 8500

Key differences: ROW_NUMBER assigns unique numbers even for ties (arbitrary tiebreak). RANK skips numbers after ties (1, 1, 3). DENSE_RANK never skips (1, 1, 2). Use ROW_NUMBER when you need exactly N rows. Use RANK or DENSE_RANK when ties matter.

Challenge 2: Top-N Per Group with ROW_NUMBER

📝
Problem: Find the top 2 highest individual sales for each salesperson. Return salesperson, sale_date, amount, and their rank within their own sales.

Solution

SELECT salesperson, sale_date, amount, rn
FROM (
    SELECT
        salesperson,
        sale_date,
        amount,
        ROW_NUMBER() OVER (
            PARTITION BY salesperson
            ORDER BY amount DESC
        ) AS rn
    FROM daily_sales
) ranked
WHERE rn <= 2
ORDER BY salesperson, rn;

-- Result:
-- salesperson | sale_date  | amount | rn
-- ------------+------------+--------+----
-- Alice       | 2024-01-02 | 2100   | 1
-- Alice       | 2024-01-04 | 1900   | 2
-- Bob         | 2024-01-03 | 2200   | 1
-- Bob         | 2024-01-05 | 1800   | 2
-- Charlie     | 2024-01-04 | 2000   | 1
-- Charlie     | 2024-01-02 | 1800   | 2

Key insight: PARTITION BY is like GROUP BY for window functions — it defines the "window" within which the function operates. ORDER BY within the OVER clause determines the ranking order. This top-N-per-group pattern is the most commonly asked window function question.

Challenge 3: LAG and LEAD — Day-over-Day Comparison

📝
Problem: For each salesperson, show their daily sales alongside their previous day's sales and the percentage change. Also show the next day's sales for forecasting context.

Solution

SELECT
    salesperson,
    sale_date,
    amount AS today_sales,
    LAG(amount, 1) OVER (
        PARTITION BY salesperson ORDER BY sale_date
    ) AS prev_day_sales,
    LEAD(amount, 1) OVER (
        PARTITION BY salesperson ORDER BY sale_date
    ) AS next_day_sales,
    ROUND(
        100.0 * (amount - LAG(amount, 1) OVER (PARTITION BY salesperson ORDER BY sale_date))
        / LAG(amount, 1) OVER (PARTITION BY salesperson ORDER BY sale_date), 1
    ) AS pct_change
FROM daily_sales
ORDER BY salesperson, sale_date;

-- Result (Alice's rows):
-- salesperson | sale_date  | today_sales | prev_day | next_day | pct_change
-- ------------+------------+-------------+----------+----------+-----------
-- Alice       | 2024-01-01 | 1500        | NULL     | 2100     | NULL
-- Alice       | 2024-01-02 | 2100        | 1500     | 1700     | 40.0
-- Alice       | 2024-01-03 | 1700        | 2100     | 1900     | -19.0
-- Alice       | 2024-01-04 | 1900        | 1700     | 1300     | 11.8
-- Alice       | 2024-01-05 | 1300        | 1900     | NULL     | -31.6

Key insight: LAG(col, N) looks N rows back. LEAD(col, N) looks N rows ahead. Both return NULL at the boundaries. You can provide a default value: LAG(amount, 1, 0) returns 0 instead of NULL for the first row. In ML, lag features are the most common time-series features.

Challenge 4: Running Total

📝
Problem: Calculate the running total of sales for each salesperson, ordered by date. Also compute the cumulative percentage of their own total sales.

Solution

SELECT
    salesperson,
    sale_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY salesperson
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,
    ROUND(
        100.0 * SUM(amount) OVER (
            PARTITION BY salesperson
            ORDER BY sale_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) / SUM(amount) OVER (PARTITION BY salesperson), 1
    ) AS cumulative_pct
FROM daily_sales
ORDER BY salesperson, sale_date;

-- Result (Alice's rows):
-- salesperson | sale_date  | amount | running_total | cumulative_pct
-- ------------+------------+--------+---------------+---------------
-- Alice       | 2024-01-01 | 1500   | 1500          | 17.6
-- Alice       | 2024-01-02 | 2100   | 3600          | 42.4
-- Alice       | 2024-01-03 | 1700   | 5300          | 62.4
-- Alice       | 2024-01-04 | 1900   | 7200          | 84.7
-- Alice       | 2024-01-05 | 1300   | 8500          | 100.0

Key insight: The frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default for ORDER BY in most databases, but being explicit shows interviewers you understand window frames. Without ORDER BY, SUM() OVER (PARTITION BY ...) gives the total for the entire partition — we use this to compute the denominator for cumulative percentage.

Challenge 5: Moving Average

📝
Problem: Calculate a 3-day moving average of total daily sales (across all salespersons). Show the daily total, the moving average, and whether each day is above or below the moving average.

Solution

WITH daily_totals AS (
    SELECT
        sale_date,
        SUM(amount) AS daily_total
    FROM daily_sales
    GROUP BY sale_date
)
SELECT
    sale_date,
    daily_total,
    ROUND(AVG(daily_total) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) AS moving_avg_3d,
    CASE
        WHEN daily_total > AVG(daily_total) OVER (
            ORDER BY sale_date
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ) THEN 'Above'
        ELSE 'Below'
    END AS vs_moving_avg
FROM daily_totals
ORDER BY sale_date;

-- Result:
-- sale_date   | daily_total | moving_avg_3d | vs_moving_avg
-- ------------+-------------+---------------+--------------
-- 2024-01-01  | 3600        | 3600.00       | Below
-- 2024-01-02  | 4700        | 4150.00       | Above
-- 2024-01-03  | 5000        | 4433.33       | Above
-- 2024-01-04  | 5400        | 5033.33       | Above
-- 2024-01-05  | 4700        | 5033.33       | Below

Key insight: ROWS BETWEEN 2 PRECEDING AND CURRENT ROW creates a 3-row window. For the first row, only 1 row is in the window. For the second, 2 rows. This is a "partial window" at the boundaries. If you need exactly 3 days, add a WHERE clause filtering out the first 2 days, or use CASE WHEN COUNT(*) OVER (...) = 3 THEN avg ELSE NULL END.

Challenge 6: NTILE and Percentile Ranking

📝
Problem: Divide all individual sales into quartiles by amount. For each sale, show which quartile it falls into, the percentile rank, and the cumulative distribution value.

Solution

SELECT
    salesperson,
    sale_date,
    amount,
    NTILE(4) OVER (ORDER BY amount) AS quartile,
    ROUND(PERCENT_RANK() OVER (ORDER BY amount), 3) AS pct_rank,
    ROUND(CUME_DIST() OVER (ORDER BY amount), 3) AS cume_dist
FROM daily_sales
ORDER BY amount;

-- Result:
-- salesperson | sale_date  | amount | quartile | pct_rank | cume_dist
-- ------------+------------+--------+----------+----------+----------
-- Bob         | 2024-01-02 | 800    | 1        | 0.000    | 0.067
-- Charlie     | 2024-01-01 | 900    | 1        | 0.071    | 0.133
-- Charlie     | 2024-01-03 | 1100   | 1        | 0.143    | 0.200
-- Bob         | 2024-01-01 | 1200   | 1        | 0.214    | 0.267
-- Alice       | 2024-01-05 | 1300   | 2        | 0.286    | 0.333
-- Alice       | 2024-01-01 | 1500   | 2        | 0.357    | 0.467
-- Bob         | 2024-01-04 | 1500   | 2        | 0.357    | 0.467
-- Charlie     | 2024-01-05 | 1600   | 2        | 0.500    | 0.533
-- Alice       | 2024-01-03 | 1700   | 3        | 0.571    | 0.600
-- Bob         | 2024-01-05 | 1800   | 3        | 0.643    | 0.733
-- Charlie     | 2024-01-02 | 1800   | 3        | 0.643    | 0.733
-- Alice       | 2024-01-04 | 1900   | 3        | 0.786    | 0.800
-- Charlie     | 2024-01-04 | 2000   | 4        | 0.857    | 0.867
-- Alice       | 2024-01-02 | 2100   | 4        | 0.929    | 0.933
-- Bob         | 2024-01-03 | 2200   | 4        | 1.000    | 1.000

Key differences: NTILE(4) divides rows into 4 roughly equal groups. PERCENT_RANK is (rank - 1) / (total_rows - 1), ranging from 0 to 1. CUME_DIST is count of rows <= current / total_rows. These are essential for ML tasks like creating bucketed features and detecting outliers (e.g., flag anything above the 99th percentile).