Beginner

SQL for Data/ML Interviews

SQL is the single most tested skill in data engineering and ML engineering interviews. This lesson covers what interviewers expect, how SQL dialects differ, and the systematic approach that top candidates use to solve problems under pressure.

Why SQL Dominates Data Interviews

Every data role — data engineer, data scientist, ML engineer, analytics engineer — requires SQL. It is the universal language of data. Interviewers test SQL because it reveals how you think about data transformations, set operations, and performance trade-offs.

📊

Data Pipelines

ETL/ELT pipelines are built on SQL. dbt, Airflow SQL operators, and Spark SQL all require fluent SQL skills that interviewers test directly.

Feature Engineering

ML feature stores use SQL to compute features at scale. Window functions, CTEs, and aggregations are the building blocks of production feature pipelines.

💡

Ad-Hoc Analysis

Product decisions are driven by SQL queries. Interviewers test whether you can translate a business question into a correct, efficient query quickly.

What Companies Actually Test

SQL interview difficulty varies by company and role. Here is what to expect:

Company TypeDifficultyFocus AreasFormat
Google / Meta DEMedium-HardWindow functions, CTEs, self-joins, optimizationOnline editor or whiteboard
Amazon DataMediumJoins, aggregation, CASE WHEN, subqueriesHackerRank or live coding
Fintech / QuantHardGaps/islands, sessionization, recursive CTEsLive coding with real data
StartupsEasy-MediumBasic joins, GROUP BY, filteringTake-home or live
Consulting / BIMediumPivot/unpivot, reporting queries, date mathCase study + SQL

SQL Dialect Differences That Matter

Most interviews accept "standard SQL" but knowing dialect differences shows depth. Here are the key differences:

-- String concatenation
-- PostgreSQL / MySQL 8+:  CONCAT(first_name, ' ', last_name)
-- SQL Server:             first_name + ' ' + last_name
-- Oracle:                 first_name || ' ' || last_name

-- Limiting results
-- PostgreSQL / MySQL:     LIMIT 10
-- SQL Server:             TOP 10 (in SELECT) or OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
-- Oracle:                 FETCH FIRST 10 ROWS ONLY (12c+) or ROWNUM <= 10

-- Date functions
-- PostgreSQL:             CURRENT_DATE, DATE_TRUNC('month', created_at)
-- MySQL:                  CURDATE(), DATE_FORMAT(created_at, '%Y-%m-01')
-- SQL Server:             GETDATE(), DATETRUNC(month, created_at) -- SQL Server 2022+

-- NULLs in aggregation
-- All dialects:           COUNT(*) counts all rows, COUNT(col) excludes NULLs
--                         SUM/AVG ignore NULLs silently (this is standard behavior)

-- UPSERT / MERGE
-- PostgreSQL:             INSERT ... ON CONFLICT DO UPDATE
-- MySQL:                  INSERT ... ON DUPLICATE KEY UPDATE
-- SQL Server:             MERGE ... WHEN MATCHED THEN UPDATE
💡
Interview tip: When an interviewer gives you a SQL problem, always ask: (1) Which SQL dialect? (2) Can I use window functions? (3) What is the approximate data size? These clarifying questions show you think about practical constraints, not just correctness.

The 5-Step Approach for SQL Challenges

Use this framework for every SQL problem in this course and in real interviews:

StepTimeWhat to Do
1. Understand the Schema2 minIdentify tables, primary keys, foreign keys, and the grain of each table (one row = one what?). Ask about NULLs.
2. Clarify the Output1 minWhat columns should the result have? What is the expected row count? Should duplicates be included?
3. Plan the Query2 minState which operations you need: joins, aggregations, window functions, CTEs. Sketch the query structure verbally.
4. Write the SQL10 minWrite clean, readable SQL. Use CTEs for complex logic. Alias tables and columns clearly.
5. Verify & Optimize3 minWalk through with sample data. Check edge cases (NULLs, empty groups, ties). Discuss indexing if asked.

Common SQL Patterns in Interviews

These patterns appear repeatedly across companies. Master them and you will handle 80% of SQL interview questions:

-- Pattern 1: Top-N per group (very common)
SELECT *
FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM employees
) ranked
WHERE rn <= 3;

-- Pattern 2: Year-over-year comparison
SELECT
    curr.year,
    curr.revenue,
    prev.revenue AS prev_revenue,
    ROUND(100.0 * (curr.revenue - prev.revenue) / prev.revenue, 2) AS yoy_growth_pct
FROM yearly_revenue curr
LEFT JOIN yearly_revenue prev ON curr.year = prev.year + 1;

-- Pattern 3: Running total
SELECT
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

-- Pattern 4: Find duplicates
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Pattern 5: Anti-join (find rows with no match)
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

Course Overview

Each lesson in this course follows a consistent structure:

  • Schema setup — CREATE TABLE and INSERT statements you can paste into any SQL editor and run immediately
  • Problem statement — Clear description of the expected output, modeled after real interview questions
  • Solution with explanation — Complete SQL with line-by-line commentary on why each clause exists
  • Expected output — The exact result set so you can verify your own solution
  • Performance notes — Indexing and optimization considerations for production use
📝
How to use this course: Type out every query yourself in a SQL editor (PostgreSQL recommended, or use SQLite/MySQL). Modify the data and re-solve. After finishing a lesson, try solving each challenge again from memory. Repetition builds the fluency that makes interviews feel natural.

Quick Self-Assessment

Before starting, try this problem without looking at the solution:

📝
Problem: Given an employees table with columns id, name, department, and salary, find the highest-paid employee in each department. Return their name, department, and salary.
-- Schema
CREATE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    department VARCHAR(50),
    salary DECIMAL(10, 2)
);

INSERT INTO employees VALUES
(1, 'Alice', 'Engineering', 120000),
(2, 'Bob', 'Engineering', 115000),
(3, 'Charlie', 'Sales', 95000),
(4, 'Diana', 'Sales', 98000),
(5, 'Eve', 'Marketing', 88000),
(6, 'Frank', 'Marketing', 92000);

-- Approach 1: Window function (cleanest)
SELECT name, department, salary
FROM (
    SELECT *, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk
    FROM employees
) ranked
WHERE rk = 1;

-- Approach 2: Correlated subquery
SELECT name, department, salary
FROM employees e
WHERE salary = (
    SELECT MAX(salary) FROM employees WHERE department = e.department
);

-- Approach 3: JOIN with aggregation
SELECT e.name, e.department, e.salary
FROM employees e
INNER JOIN (
    SELECT department, MAX(salary) AS max_salary
    FROM employees
    GROUP BY department
) m ON e.department = m.department AND e.salary = m.max_salary;

-- Result:
-- name    | department  | salary
-- --------+-------------+---------
-- Alice   | Engineering | 120000
-- Diana   | Sales       |  98000
-- Frank   | Marketing   |  92000

Why this matters for ML: Finding top-K items per group is fundamental in recommendation systems, model evaluation, and A/B testing. The window function approach is used daily for tasks like finding the best model per experiment or the top features per category.