Window Functions
Window functions are the single most differentiating SQL skill in data/ML interviews. They let you compute values across related rows without collapsing the result set. These 6 challenges cover every window function pattern you need.
Shared Schema for This Lesson
CREATE TABLE daily_sales (
sale_id INT PRIMARY KEY,
salesperson VARCHAR(50),
region VARCHAR(30),
sale_date DATE,
amount DECIMAL(10,2),
product VARCHAR(50)
);
INSERT INTO daily_sales VALUES
(1, 'Alice', 'West', '2024-01-01', 1500, 'Enterprise'),
(2, 'Bob', 'East', '2024-01-01', 1200, 'Pro'),
(3, 'Charlie', 'West', '2024-01-01', 900, 'Starter'),
(4, 'Alice', 'West', '2024-01-02', 2100, 'Enterprise'),
(5, 'Bob', 'East', '2024-01-02', 800, 'Starter'),
(6, 'Charlie', 'West', '2024-01-02', 1800, 'Pro'),
(7, 'Alice', 'West', '2024-01-03', 1700, 'Pro'),
(8, 'Bob', 'East', '2024-01-03', 2200, 'Enterprise'),
(9, 'Charlie', 'West', '2024-01-03', 1100, 'Pro'),
(10, 'Alice', 'West', '2024-01-04', 1900, 'Enterprise'),
(11, 'Bob', 'East', '2024-01-04', 1500, 'Pro'),
(12, 'Charlie', 'West', '2024-01-04', 2000, 'Enterprise'),
(13, 'Alice', 'West', '2024-01-05', 1300, 'Starter'),
(14, 'Bob', 'East', '2024-01-05', 1800, 'Enterprise'),
(15, 'Charlie', 'West', '2024-01-05', 1600, 'Pro');
Challenge 1: ROW_NUMBER vs RANK vs DENSE_RANK
Solution
-- Step 1: Compute total sales per person, then apply all three ranking functions
WITH totals AS (
SELECT
salesperson,
SUM(amount) AS total_sales
FROM daily_sales
GROUP BY salesperson
)
SELECT
salesperson,
total_sales,
ROW_NUMBER() OVER (ORDER BY total_sales DESC) AS row_num,
RANK() OVER (ORDER BY total_sales DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY total_sales DESC) AS dense_rank_val
FROM totals;
-- Result:
-- salesperson | total_sales | row_num | rank_val | dense_rank_val
-- ------------+-------------+---------+----------+---------------
-- Alice | 8500 | 1 | 1 | 1
-- Bob | 7500 | 2 | 2 | 2
-- Charlie | 7400 | 3 | 3 | 3
-- To find top 1 (with ties):
SELECT salesperson, total_sales
FROM (
SELECT salesperson, SUM(amount) AS total_sales,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rk
FROM daily_sales
GROUP BY salesperson
) ranked
WHERE rk = 1;
-- Result:
-- salesperson | total_sales
-- ------------+------------
-- Alice | 8500
Key differences: ROW_NUMBER assigns unique numbers even for ties (arbitrary tiebreak). RANK skips numbers after ties (1, 1, 3). DENSE_RANK never skips (1, 1, 2). Use ROW_NUMBER when you need exactly N rows. Use RANK or DENSE_RANK when ties matter.
Challenge 2: Top-N Per Group with ROW_NUMBER
Solution
SELECT salesperson, sale_date, amount, rn
FROM (
SELECT
salesperson,
sale_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY salesperson
ORDER BY amount DESC
) AS rn
FROM daily_sales
) ranked
WHERE rn <= 2
ORDER BY salesperson, rn;
-- Result:
-- salesperson | sale_date | amount | rn
-- ------------+------------+--------+----
-- Alice | 2024-01-02 | 2100 | 1
-- Alice | 2024-01-04 | 1900 | 2
-- Bob | 2024-01-03 | 2200 | 1
-- Bob | 2024-01-05 | 1800 | 2
-- Charlie | 2024-01-04 | 2000 | 1
-- Charlie | 2024-01-02 | 1800 | 2
Key insight: PARTITION BY is like GROUP BY for window functions — it defines the "window" within which the function operates. ORDER BY within the OVER clause determines the ranking order. This top-N-per-group pattern is the most commonly asked window function question.
Challenge 3: LAG and LEAD — Day-over-Day Comparison
Solution
SELECT
salesperson,
sale_date,
amount AS today_sales,
LAG(amount, 1) OVER (
PARTITION BY salesperson ORDER BY sale_date
) AS prev_day_sales,
LEAD(amount, 1) OVER (
PARTITION BY salesperson ORDER BY sale_date
) AS next_day_sales,
ROUND(
100.0 * (amount - LAG(amount, 1) OVER (PARTITION BY salesperson ORDER BY sale_date))
/ LAG(amount, 1) OVER (PARTITION BY salesperson ORDER BY sale_date), 1
) AS pct_change
FROM daily_sales
ORDER BY salesperson, sale_date;
-- Result (Alice's rows):
-- salesperson | sale_date | today_sales | prev_day | next_day | pct_change
-- ------------+------------+-------------+----------+----------+-----------
-- Alice | 2024-01-01 | 1500 | NULL | 2100 | NULL
-- Alice | 2024-01-02 | 2100 | 1500 | 1700 | 40.0
-- Alice | 2024-01-03 | 1700 | 2100 | 1900 | -19.0
-- Alice | 2024-01-04 | 1900 | 1700 | 1300 | 11.8
-- Alice | 2024-01-05 | 1300 | 1900 | NULL | -31.6
Key insight: LAG(col, N) looks N rows back. LEAD(col, N) looks N rows ahead. Both return NULL at the boundaries. You can provide a default value: LAG(amount, 1, 0) returns 0 instead of NULL for the first row. In ML, lag features are the most common time-series features.
Challenge 4: Running Total
Solution
SELECT
salesperson,
sale_date,
amount,
SUM(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
ROUND(
100.0 * SUM(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / SUM(amount) OVER (PARTITION BY salesperson), 1
) AS cumulative_pct
FROM daily_sales
ORDER BY salesperson, sale_date;
-- Result (Alice's rows):
-- salesperson | sale_date | amount | running_total | cumulative_pct
-- ------------+------------+--------+---------------+---------------
-- Alice | 2024-01-01 | 1500 | 1500 | 17.6
-- Alice | 2024-01-02 | 2100 | 3600 | 42.4
-- Alice | 2024-01-03 | 1700 | 5300 | 62.4
-- Alice | 2024-01-04 | 1900 | 7200 | 84.7
-- Alice | 2024-01-05 | 1300 | 8500 | 100.0
Key insight: The frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default for ORDER BY in most databases, but being explicit shows interviewers you understand window frames. Without ORDER BY, SUM() OVER (PARTITION BY ...) gives the total for the entire partition — we use this to compute the denominator for cumulative percentage.
Challenge 5: Moving Average
Solution
WITH daily_totals AS (
SELECT
sale_date,
SUM(amount) AS daily_total
FROM daily_sales
GROUP BY sale_date
)
SELECT
sale_date,
daily_total,
ROUND(AVG(daily_total) OVER (
ORDER BY sale_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 2) AS moving_avg_3d,
CASE
WHEN daily_total > AVG(daily_total) OVER (
ORDER BY sale_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) THEN 'Above'
ELSE 'Below'
END AS vs_moving_avg
FROM daily_totals
ORDER BY sale_date;
-- Result:
-- sale_date | daily_total | moving_avg_3d | vs_moving_avg
-- ------------+-------------+---------------+--------------
-- 2024-01-01 | 3600 | 3600.00 | Below
-- 2024-01-02 | 4700 | 4150.00 | Above
-- 2024-01-03 | 5000 | 4433.33 | Above
-- 2024-01-04 | 5400 | 5033.33 | Above
-- 2024-01-05 | 4700 | 5033.33 | Below
Key insight: ROWS BETWEEN 2 PRECEDING AND CURRENT ROW creates a 3-row window. For the first row, only 1 row is in the window. For the second, 2 rows. This is a "partial window" at the boundaries. If you need exactly 3 days, add a WHERE clause filtering out the first 2 days, or use CASE WHEN COUNT(*) OVER (...) = 3 THEN avg ELSE NULL END.
Challenge 6: NTILE and Percentile Ranking
Solution
SELECT
salesperson,
sale_date,
amount,
NTILE(4) OVER (ORDER BY amount) AS quartile,
ROUND(PERCENT_RANK() OVER (ORDER BY amount), 3) AS pct_rank,
ROUND(CUME_DIST() OVER (ORDER BY amount), 3) AS cume_dist
FROM daily_sales
ORDER BY amount;
-- Result:
-- salesperson | sale_date | amount | quartile | pct_rank | cume_dist
-- ------------+------------+--------+----------+----------+----------
-- Bob | 2024-01-02 | 800 | 1 | 0.000 | 0.067
-- Charlie | 2024-01-01 | 900 | 1 | 0.071 | 0.133
-- Charlie | 2024-01-03 | 1100 | 1 | 0.143 | 0.200
-- Bob | 2024-01-01 | 1200 | 1 | 0.214 | 0.267
-- Alice | 2024-01-05 | 1300 | 2 | 0.286 | 0.333
-- Alice | 2024-01-01 | 1500 | 2 | 0.357 | 0.467
-- Bob | 2024-01-04 | 1500 | 2 | 0.357 | 0.467
-- Charlie | 2024-01-05 | 1600 | 2 | 0.500 | 0.533
-- Alice | 2024-01-03 | 1700 | 3 | 0.571 | 0.600
-- Bob | 2024-01-05 | 1800 | 3 | 0.643 | 0.733
-- Charlie | 2024-01-02 | 1800 | 3 | 0.643 | 0.733
-- Alice | 2024-01-04 | 1900 | 3 | 0.786 | 0.800
-- Charlie | 2024-01-04 | 2000 | 4 | 0.857 | 0.867
-- Alice | 2024-01-02 | 2100 | 4 | 0.929 | 0.933
-- Bob | 2024-01-03 | 2200 | 4 | 1.000 | 1.000
Key differences: NTILE(4) divides rows into 4 roughly equal groups. PERCENT_RANK is (rank - 1) / (total_rows - 1), ranging from 0 to 1. CUME_DIST is count of rows <= current / total_rows. These are essential for ML tasks like creating bucketed features and detecting outliers (e.g., flag anything above the 99th percentile).
Lilly Tech Systems