SQL for Data/ML Interviews
SQL is the single most tested skill in data engineering and ML engineering interviews. This lesson covers what interviewers expect, how SQL dialects differ, and the systematic approach that top candidates use to solve problems under pressure.
Why SQL Dominates Data Interviews
Every data role — data engineer, data scientist, ML engineer, analytics engineer — requires SQL. It is the universal language of data. Interviewers test SQL because it reveals how you think about data transformations, set operations, and performance trade-offs.
Data Pipelines
ETL/ELT pipelines are built on SQL. dbt, Airflow SQL operators, and Spark SQL all require fluent SQL skills that interviewers test directly.
Feature Engineering
ML feature stores use SQL to compute features at scale. Window functions, CTEs, and aggregations are the building blocks of production feature pipelines.
Ad-Hoc Analysis
Product decisions are driven by SQL queries. Interviewers test whether you can translate a business question into a correct, efficient query quickly.
What Companies Actually Test
SQL interview difficulty varies by company and role. Here is what to expect:
| Company Type | Difficulty | Focus Areas | Format |
|---|---|---|---|
| Google / Meta DE | Medium-Hard | Window functions, CTEs, self-joins, optimization | Online editor or whiteboard |
| Amazon Data | Medium | Joins, aggregation, CASE WHEN, subqueries | HackerRank or live coding |
| Fintech / Quant | Hard | Gaps/islands, sessionization, recursive CTEs | Live coding with real data |
| Startups | Easy-Medium | Basic joins, GROUP BY, filtering | Take-home or live |
| Consulting / BI | Medium | Pivot/unpivot, reporting queries, date math | Case study + SQL |
SQL Dialect Differences That Matter
Most interviews accept "standard SQL" but knowing dialect differences shows depth. Here are the key differences:
-- String concatenation
-- PostgreSQL / MySQL 8+: CONCAT(first_name, ' ', last_name)
-- SQL Server: first_name + ' ' + last_name
-- Oracle: first_name || ' ' || last_name
-- Limiting results
-- PostgreSQL / MySQL: LIMIT 10
-- SQL Server: TOP 10 (in SELECT) or OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
-- Oracle: FETCH FIRST 10 ROWS ONLY (12c+) or ROWNUM <= 10
-- Date functions
-- PostgreSQL: CURRENT_DATE, DATE_TRUNC('month', created_at)
-- MySQL: CURDATE(), DATE_FORMAT(created_at, '%Y-%m-01')
-- SQL Server: GETDATE(), DATETRUNC(month, created_at) -- SQL Server 2022+
-- NULLs in aggregation
-- All dialects: COUNT(*) counts all rows, COUNT(col) excludes NULLs
-- SUM/AVG ignore NULLs silently (this is standard behavior)
-- UPSERT / MERGE
-- PostgreSQL: INSERT ... ON CONFLICT DO UPDATE
-- MySQL: INSERT ... ON DUPLICATE KEY UPDATE
-- SQL Server: MERGE ... WHEN MATCHED THEN UPDATE
The 5-Step Approach for SQL Challenges
Use this framework for every SQL problem in this course and in real interviews:
| Step | Time | What to Do |
|---|---|---|
| 1. Understand the Schema | 2 min | Identify tables, primary keys, foreign keys, and the grain of each table (one row = one what?). Ask about NULLs. |
| 2. Clarify the Output | 1 min | What columns should the result have? What is the expected row count? Should duplicates be included? |
| 3. Plan the Query | 2 min | State which operations you need: joins, aggregations, window functions, CTEs. Sketch the query structure verbally. |
| 4. Write the SQL | 10 min | Write clean, readable SQL. Use CTEs for complex logic. Alias tables and columns clearly. |
| 5. Verify & Optimize | 3 min | Walk through with sample data. Check edge cases (NULLs, empty groups, ties). Discuss indexing if asked. |
Common SQL Patterns in Interviews
These patterns appear repeatedly across companies. Master them and you will handle 80% of SQL interview questions:
-- Pattern 1: Top-N per group (very common)
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) ranked
WHERE rn <= 3;
-- Pattern 2: Year-over-year comparison
SELECT
curr.year,
curr.revenue,
prev.revenue AS prev_revenue,
ROUND(100.0 * (curr.revenue - prev.revenue) / prev.revenue, 2) AS yoy_growth_pct
FROM yearly_revenue curr
LEFT JOIN yearly_revenue prev ON curr.year = prev.year + 1;
-- Pattern 3: Running total
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;
-- Pattern 4: Find duplicates
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
-- Pattern 5: Anti-join (find rows with no match)
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;
Course Overview
Each lesson in this course follows a consistent structure:
- Schema setup — CREATE TABLE and INSERT statements you can paste into any SQL editor and run immediately
- Problem statement — Clear description of the expected output, modeled after real interview questions
- Solution with explanation — Complete SQL with line-by-line commentary on why each clause exists
- Expected output — The exact result set so you can verify your own solution
- Performance notes — Indexing and optimization considerations for production use
Quick Self-Assessment
Before starting, try this problem without looking at the solution:
employees table with columns id, name, department, and salary, find the highest-paid employee in each department. Return their name, department, and salary.-- Schema
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100),
department VARCHAR(50),
salary DECIMAL(10, 2)
);
INSERT INTO employees VALUES
(1, 'Alice', 'Engineering', 120000),
(2, 'Bob', 'Engineering', 115000),
(3, 'Charlie', 'Sales', 95000),
(4, 'Diana', 'Sales', 98000),
(5, 'Eve', 'Marketing', 88000),
(6, 'Frank', 'Marketing', 92000);
-- Approach 1: Window function (cleanest)
SELECT name, department, salary
FROM (
SELECT *, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk
FROM employees
) ranked
WHERE rk = 1;
-- Approach 2: Correlated subquery
SELECT name, department, salary
FROM employees e
WHERE salary = (
SELECT MAX(salary) FROM employees WHERE department = e.department
);
-- Approach 3: JOIN with aggregation
SELECT e.name, e.department, e.salary
FROM employees e
INNER JOIN (
SELECT department, MAX(salary) AS max_salary
FROM employees
GROUP BY department
) m ON e.department = m.department AND e.salary = m.max_salary;
-- Result:
-- name | department | salary
-- --------+-------------+---------
-- Alice | Engineering | 120000
-- Diana | Sales | 98000
-- Frank | Marketing | 92000
Why this matters for ML: Finding top-K items per group is fundamental in recommendation systems, model evaluation, and A/B testing. The window function approach is used daily for tasks like finding the best model per experiment or the top features per category.
Lilly Tech Systems