Joins & Subqueries
Joins are the most tested SQL topic in data engineering interviews. These 6 challenges cover every join pattern you need to know, including the self-join and anti-join patterns that trip up most candidates.
Shared Schema for This Lesson
CREATE TABLE employees (
emp_id INT PRIMARY KEY,
name VARCHAR(100),
department VARCHAR(50),
salary DECIMAL(10,2),
manager_id INT,
hire_date DATE
);
INSERT INTO employees VALUES
(1, 'Alice', 'Engineering', 130000, NULL, '2020-03-15'),
(2, 'Bob', 'Engineering', 115000, 1, '2021-06-01'),
(3, 'Charlie', 'Engineering', 105000, 1, '2022-01-10'),
(4, 'Diana', 'Sales', 95000, NULL, '2019-11-20'),
(5, 'Eve', 'Sales', 88000, 4, '2021-04-15'),
(6, 'Frank', 'Sales', 82000, 4, '2023-02-01'),
(7, 'Grace', 'Marketing', 98000, NULL, '2020-08-10'),
(8, 'Hank', 'Marketing', 75000, 7, '2022-09-15');
CREATE TABLE projects (
project_id INT PRIMARY KEY,
project_name VARCHAR(100),
department VARCHAR(50),
budget DECIMAL(12,2),
start_date DATE,
end_date DATE
);
INSERT INTO projects VALUES
(1, 'ML Pipeline', 'Engineering', 500000, '2024-01-01', '2024-06-30'),
(2, 'Data Lake', 'Engineering', 300000, '2024-03-01', '2024-09-30'),
(3, 'CRM Upgrade', 'Sales', 150000, '2024-02-01', '2024-05-31'),
(4, 'Brand Refresh', 'Marketing', 200000, '2024-01-15', '2024-04-30'),
(5, 'API Gateway', 'Engineering', 250000, '2024-06-01', '2024-12-31');
CREATE TABLE project_assignments (
emp_id INT,
project_id INT,
role VARCHAR(50),
hours_per_week INT,
PRIMARY KEY (emp_id, project_id)
);
INSERT INTO project_assignments VALUES
(1, 1, 'Lead', 20), (2, 1, 'Developer', 30), (3, 1, 'Developer', 35),
(1, 2, 'Architect', 15), (3, 2, 'Developer', 10),
(4, 3, 'Lead', 25), (5, 3, 'Analyst', 30),
(7, 4, 'Lead', 20), (8, 4, 'Designer', 35);
Challenge 1: INNER JOIN — Employees and Their Projects
Solution
SELECT
e.name,
e.department,
p.project_name,
pa.role,
pa.hours_per_week,
p.budget
FROM employees e
INNER JOIN project_assignments pa ON e.emp_id = pa.emp_id
INNER JOIN projects p ON pa.project_id = p.project_id
ORDER BY e.name, p.project_name;
-- Result:
-- name | department | project_name | role | hours_per_week | budget
-- --------+-------------+--------------+-----------+----------------+--------
-- Alice | Engineering | Data Lake | Architect | 15 | 300000
-- Alice | Engineering | ML Pipeline | Lead | 20 | 500000
-- Bob | Engineering | ML Pipeline | Developer | 30 | 500000
-- Charlie | Engineering | Data Lake | Developer | 10 | 300000
-- Charlie | Engineering | ML Pipeline | Developer | 35 | 500000
-- Diana | Sales | CRM Upgrade | Lead | 25 | 150000
-- Eve | Sales | CRM Upgrade | Analyst | 30 | 150000
-- Grace | Marketing | Brand Refresh| Lead | 20 | 200000
-- Hank | Marketing | Brand Refresh| Designer | 35 | 200000
Key insight: INNER JOIN only returns rows that match in both tables. Frank (emp_id=6) is missing because he has no project assignment. The API Gateway project (project_id=5) is missing because no one is assigned to it.
Challenge 2: LEFT JOIN — Find Unassigned Employees
Solution
SELECT
e.name,
e.department,
COALESCE(SUM(pa.hours_per_week), 0) AS total_hours,
COUNT(pa.project_id) AS num_projects
FROM employees e
LEFT JOIN project_assignments pa ON e.emp_id = pa.emp_id
GROUP BY e.emp_id, e.name, e.department
ORDER BY total_hours DESC;
-- Result:
-- name | department | total_hours | num_projects
-- --------+-------------+-------------+-------------
-- Charlie | Engineering | 45 | 2
-- Hank | Marketing | 35 | 1
-- Alice | Engineering | 35 | 2
-- Bob | Engineering | 30 | 1
-- Eve | Sales | 30 | 1
-- Diana | Sales | 25 | 1
-- Grace | Marketing | 20 | 1
-- Frank | Sales | 0 | 0
Key insight: Use COUNT(pa.project_id) instead of COUNT(*). COUNT(*) would return 1 for Frank (counting the NULL row from the LEFT JOIN), while COUNT(column) correctly returns 0 because it skips NULLs.
Challenge 3: Self-Join — Employees and Their Managers
Solution
SELECT
e.name AS employee,
e.department,
e.salary AS emp_salary,
COALESCE(m.name, 'CEO/Director') AS manager,
COALESCE(m.salary, 0) AS mgr_salary,
CASE
WHEN m.salary IS NOT NULL THEN e.salary - m.salary
ELSE NULL
END AS salary_diff
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id
ORDER BY e.department, e.salary DESC;
-- Result:
-- employee | department | emp_salary | manager | mgr_salary | salary_diff
-- ---------+-------------+------------+--------------+------------+------------
-- Alice | Engineering | 130000 | CEO/Director | 0 | NULL
-- Bob | Engineering | 115000 | Alice | 130000 | -15000
-- Charlie | Engineering | 105000 | Alice | 130000 | -25000
-- Grace | Marketing | 98000 | CEO/Director | 0 | NULL
-- Hank | Marketing | 75000 | Grace | 98000 | -23000
-- Diana | Sales | 95000 | CEO/Director | 0 | NULL
-- Eve | Sales | 88000 | Diana | 95000 | -7000
-- Frank | Sales | 82000 | Diana | 95000 | -13000
Key insight: A self-join joins a table to itself using different aliases (e and m). The LEFT JOIN ensures top-level managers (with NULL manager_id) are included. Self-joins appear in almost every data engineering interview because they test whether you can reason about table relationships within a single table.
Challenge 4: Correlated Subquery — Above-Average Earners per Department
Solution
-- Approach 1: Correlated subquery
SELECT
e.name,
e.department,
e.salary,
(SELECT ROUND(AVG(salary), 2) FROM employees WHERE department = e.department) AS dept_avg,
ROUND(e.salary - (SELECT AVG(salary) FROM employees WHERE department = e.department), 2) AS above_avg
FROM employees e
WHERE e.salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
)
ORDER BY above_avg DESC;
-- Approach 2: Window function (preferred for performance)
SELECT name, department, salary, dept_avg, ROUND(salary - dept_avg, 2) AS above_avg
FROM (
SELECT *, ROUND(AVG(salary) OVER (PARTITION BY department), 2) AS dept_avg
FROM employees
) with_avg
WHERE salary > dept_avg
ORDER BY above_avg DESC;
-- Result:
-- name | department | salary | dept_avg | above_avg
-- ------+-------------+--------+-----------+----------
-- Alice | Engineering | 130000 | 116666.67 | 13333.33
-- Diana | Sales | 95000 | 88333.33 | 6666.67
-- Grace | Marketing | 98000 | 86500.00 | 11500.00
Performance note: The correlated subquery executes the inner query once per row in the outer query. For large tables, the window function approach is significantly faster because it computes the average in a single pass. Always mention this trade-off in interviews.
Challenge 5: EXISTS — Departments with Active Projects
Solution
SELECT e.name, e.department, e.salary
FROM employees e
WHERE EXISTS (
SELECT 1
FROM projects p
WHERE p.department = e.department
AND p.budget > 200000
)
ORDER BY e.department, e.salary DESC;
-- Result:
-- name | department | salary
-- --------+-------------+--------
-- Alice | Engineering | 130000
-- Bob | Engineering | 115000
-- Charlie | Engineering | 105000
-- Alternative using IN (equivalent behavior):
SELECT name, department, salary
FROM employees
WHERE department IN (
SELECT department FROM projects WHERE budget > 200000
)
ORDER BY department, salary DESC;
Key insight: EXISTS returns TRUE as soon as it finds a matching row — it does not scan the entire subquery. This makes it faster than IN for large subquery result sets. Use EXISTS when you only need to check for existence, not retrieve values from the subquery. In most modern databases, the optimizer treats IN and EXISTS identically, but stating your intent with EXISTS shows interview maturity.
Challenge 6: Anti-Join — Projects with No Assignments
Solution
-- Approach 1: LEFT JOIN + IS NULL (most common in interviews)
SELECT p.project_id, p.project_name, p.department, p.budget
FROM projects p
LEFT JOIN project_assignments pa ON p.project_id = pa.project_id
WHERE pa.emp_id IS NULL;
-- Approach 2: NOT EXISTS (safest with NULLs)
SELECT p.project_id, p.project_name, p.department, p.budget
FROM projects p
WHERE NOT EXISTS (
SELECT 1
FROM project_assignments pa
WHERE pa.project_id = p.project_id
);
-- Approach 3: NOT IN (dangerous with NULLs!)
SELECT project_id, project_name, department, budget
FROM projects
WHERE project_id NOT IN (
SELECT project_id FROM project_assignments
);
-- Result (all three):
-- project_id | project_name | department | budget
-- -----------+--------------+-------------+--------
-- 5 | API Gateway | Engineering | 250000
SELECT project_id FROM project_assignments returns any NULL values, NOT IN returns zero rows for every outer row. This is because x NOT IN (1, 2, NULL) evaluates to x != 1 AND x != 2 AND x != NULL, and x != NULL is always NULL (unknown), making the entire expression NULL. Use NOT EXISTS or LEFT JOIN to avoid this trap. This is a classic interview gotcha.
Lilly Tech Systems