Intermediate

Joins & Subqueries

Joins are the most tested SQL topic in data engineering interviews. These 6 challenges cover every join pattern you need to know, including the self-join and anti-join patterns that trip up most candidates.

💡
Why this matters for ML: Feature engineering across multiple tables is a daily task. Joining user events with user profiles, products with categories, and predictions with ground truth requires fluent join skills. Wrong join types cause silent data leakage in ML training sets.

Shared Schema for This Lesson

CREATE TABLE employees (
    emp_id INT PRIMARY KEY,
    name VARCHAR(100),
    department VARCHAR(50),
    salary DECIMAL(10,2),
    manager_id INT,
    hire_date DATE
);

INSERT INTO employees VALUES
(1, 'Alice', 'Engineering', 130000, NULL, '2020-03-15'),
(2, 'Bob', 'Engineering', 115000, 1, '2021-06-01'),
(3, 'Charlie', 'Engineering', 105000, 1, '2022-01-10'),
(4, 'Diana', 'Sales', 95000, NULL, '2019-11-20'),
(5, 'Eve', 'Sales', 88000, 4, '2021-04-15'),
(6, 'Frank', 'Sales', 82000, 4, '2023-02-01'),
(7, 'Grace', 'Marketing', 98000, NULL, '2020-08-10'),
(8, 'Hank', 'Marketing', 75000, 7, '2022-09-15');

CREATE TABLE projects (
    project_id INT PRIMARY KEY,
    project_name VARCHAR(100),
    department VARCHAR(50),
    budget DECIMAL(12,2),
    start_date DATE,
    end_date DATE
);

INSERT INTO projects VALUES
(1, 'ML Pipeline', 'Engineering', 500000, '2024-01-01', '2024-06-30'),
(2, 'Data Lake', 'Engineering', 300000, '2024-03-01', '2024-09-30'),
(3, 'CRM Upgrade', 'Sales', 150000, '2024-02-01', '2024-05-31'),
(4, 'Brand Refresh', 'Marketing', 200000, '2024-01-15', '2024-04-30'),
(5, 'API Gateway', 'Engineering', 250000, '2024-06-01', '2024-12-31');

CREATE TABLE project_assignments (
    emp_id INT,
    project_id INT,
    role VARCHAR(50),
    hours_per_week INT,
    PRIMARY KEY (emp_id, project_id)
);

INSERT INTO project_assignments VALUES
(1, 1, 'Lead', 20), (2, 1, 'Developer', 30), (3, 1, 'Developer', 35),
(1, 2, 'Architect', 15), (3, 2, 'Developer', 10),
(4, 3, 'Lead', 25), (5, 3, 'Analyst', 30),
(7, 4, 'Lead', 20), (8, 4, 'Designer', 35);

Challenge 1: INNER JOIN — Employees and Their Projects

📝
Problem: List all employees with their assigned projects, roles, and weekly hours. Include the project name and budget. Sort by employee name, then project name.

Solution

SELECT
    e.name,
    e.department,
    p.project_name,
    pa.role,
    pa.hours_per_week,
    p.budget
FROM employees e
INNER JOIN project_assignments pa ON e.emp_id = pa.emp_id
INNER JOIN projects p ON pa.project_id = p.project_id
ORDER BY e.name, p.project_name;

-- Result:
-- name    | department  | project_name | role      | hours_per_week | budget
-- --------+-------------+--------------+-----------+----------------+--------
-- Alice   | Engineering | Data Lake    | Architect | 15             | 300000
-- Alice   | Engineering | ML Pipeline  | Lead      | 20             | 500000
-- Bob     | Engineering | ML Pipeline  | Developer | 30             | 500000
-- Charlie | Engineering | Data Lake    | Developer | 10             | 300000
-- Charlie | Engineering | ML Pipeline  | Developer | 35             | 500000
-- Diana   | Sales       | CRM Upgrade  | Lead      | 25             | 150000
-- Eve     | Sales       | CRM Upgrade  | Analyst   | 30             | 150000
-- Grace   | Marketing   | Brand Refresh| Lead      | 20             | 200000
-- Hank    | Marketing   | Brand Refresh| Designer  | 35             | 200000

Key insight: INNER JOIN only returns rows that match in both tables. Frank (emp_id=6) is missing because he has no project assignment. The API Gateway project (project_id=5) is missing because no one is assigned to it.

Challenge 2: LEFT JOIN — Find Unassigned Employees

📝
Problem: Find all employees and their total weekly project hours. Employees with no project assignments should appear with 0 hours. Also show how many projects each employee is on.

Solution

SELECT
    e.name,
    e.department,
    COALESCE(SUM(pa.hours_per_week), 0) AS total_hours,
    COUNT(pa.project_id) AS num_projects
FROM employees e
LEFT JOIN project_assignments pa ON e.emp_id = pa.emp_id
GROUP BY e.emp_id, e.name, e.department
ORDER BY total_hours DESC;

-- Result:
-- name    | department  | total_hours | num_projects
-- --------+-------------+-------------+-------------
-- Charlie | Engineering | 45          | 2
-- Hank    | Marketing   | 35          | 1
-- Alice   | Engineering | 35          | 2
-- Bob     | Engineering | 30          | 1
-- Eve     | Sales       | 30          | 1
-- Diana   | Sales       | 25          | 1
-- Grace   | Marketing   | 20          | 1
-- Frank   | Sales       | 0           | 0

Key insight: Use COUNT(pa.project_id) instead of COUNT(*). COUNT(*) would return 1 for Frank (counting the NULL row from the LEFT JOIN), while COUNT(column) correctly returns 0 because it skips NULLs.

Challenge 3: Self-Join — Employees and Their Managers

📝
Problem: List each employee with their manager's name. For employees who are managers themselves (manager_id is NULL), show 'CEO/Director' as the manager name. Also show the salary difference between each employee and their manager.

Solution

SELECT
    e.name AS employee,
    e.department,
    e.salary AS emp_salary,
    COALESCE(m.name, 'CEO/Director') AS manager,
    COALESCE(m.salary, 0) AS mgr_salary,
    CASE
        WHEN m.salary IS NOT NULL THEN e.salary - m.salary
        ELSE NULL
    END AS salary_diff
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id
ORDER BY e.department, e.salary DESC;

-- Result:
-- employee | department  | emp_salary | manager      | mgr_salary | salary_diff
-- ---------+-------------+------------+--------------+------------+------------
-- Alice    | Engineering | 130000     | CEO/Director | 0          | NULL
-- Bob      | Engineering | 115000     | Alice        | 130000     | -15000
-- Charlie  | Engineering | 105000     | Alice        | 130000     | -25000
-- Grace    | Marketing   | 98000      | CEO/Director | 0          | NULL
-- Hank     | Marketing   | 75000      | Grace        | 98000      | -23000
-- Diana    | Sales       | 95000      | CEO/Director | 0          | NULL
-- Eve      | Sales       | 88000      | Diana        | 95000      | -7000
-- Frank    | Sales       | 82000      | Diana        | 95000      | -13000

Key insight: A self-join joins a table to itself using different aliases (e and m). The LEFT JOIN ensures top-level managers (with NULL manager_id) are included. Self-joins appear in almost every data engineering interview because they test whether you can reason about table relationships within a single table.

Challenge 4: Correlated Subquery — Above-Average Earners per Department

📝
Problem: Find employees who earn more than the average salary in their department. Show the employee name, department, salary, department average, and how much they earn above the average.

Solution

-- Approach 1: Correlated subquery
SELECT
    e.name,
    e.department,
    e.salary,
    (SELECT ROUND(AVG(salary), 2) FROM employees WHERE department = e.department) AS dept_avg,
    ROUND(e.salary - (SELECT AVG(salary) FROM employees WHERE department = e.department), 2) AS above_avg
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department = e.department
)
ORDER BY above_avg DESC;

-- Approach 2: Window function (preferred for performance)
SELECT name, department, salary, dept_avg, ROUND(salary - dept_avg, 2) AS above_avg
FROM (
    SELECT *, ROUND(AVG(salary) OVER (PARTITION BY department), 2) AS dept_avg
    FROM employees
) with_avg
WHERE salary > dept_avg
ORDER BY above_avg DESC;

-- Result:
-- name  | department  | salary | dept_avg  | above_avg
-- ------+-------------+--------+-----------+----------
-- Alice | Engineering | 130000 | 116666.67 | 13333.33
-- Diana | Sales       | 95000  | 88333.33  | 6666.67
-- Grace | Marketing   | 98000  | 86500.00  | 11500.00

Performance note: The correlated subquery executes the inner query once per row in the outer query. For large tables, the window function approach is significantly faster because it computes the average in a single pass. Always mention this trade-off in interviews.

Challenge 5: EXISTS — Departments with Active Projects

📝
Problem: Find all employees who belong to departments that have at least one project with a budget over $200,000. Use EXISTS instead of a JOIN.

Solution

SELECT e.name, e.department, e.salary
FROM employees e
WHERE EXISTS (
    SELECT 1
    FROM projects p
    WHERE p.department = e.department
      AND p.budget > 200000
)
ORDER BY e.department, e.salary DESC;

-- Result:
-- name    | department  | salary
-- --------+-------------+--------
-- Alice   | Engineering | 130000
-- Bob     | Engineering | 115000
-- Charlie | Engineering | 105000

-- Alternative using IN (equivalent behavior):
SELECT name, department, salary
FROM employees
WHERE department IN (
    SELECT department FROM projects WHERE budget > 200000
)
ORDER BY department, salary DESC;

Key insight: EXISTS returns TRUE as soon as it finds a matching row — it does not scan the entire subquery. This makes it faster than IN for large subquery result sets. Use EXISTS when you only need to check for existence, not retrieve values from the subquery. In most modern databases, the optimizer treats IN and EXISTS identically, but stating your intent with EXISTS shows interview maturity.

Challenge 6: Anti-Join — Projects with No Assignments

📝
Problem: Find all projects that have no employees assigned to them. Show three different approaches: LEFT JOIN, NOT EXISTS, and NOT IN. Explain when each approach is safest.

Solution

-- Approach 1: LEFT JOIN + IS NULL (most common in interviews)
SELECT p.project_id, p.project_name, p.department, p.budget
FROM projects p
LEFT JOIN project_assignments pa ON p.project_id = pa.project_id
WHERE pa.emp_id IS NULL;

-- Approach 2: NOT EXISTS (safest with NULLs)
SELECT p.project_id, p.project_name, p.department, p.budget
FROM projects p
WHERE NOT EXISTS (
    SELECT 1
    FROM project_assignments pa
    WHERE pa.project_id = p.project_id
);

-- Approach 3: NOT IN (dangerous with NULLs!)
SELECT project_id, project_name, department, budget
FROM projects
WHERE project_id NOT IN (
    SELECT project_id FROM project_assignments
);

-- Result (all three):
-- project_id | project_name | department  | budget
-- -----------+--------------+-------------+--------
-- 5          | API Gateway  | Engineering | 250000
Critical NULL trap with NOT IN: If the subquery SELECT project_id FROM project_assignments returns any NULL values, NOT IN returns zero rows for every outer row. This is because x NOT IN (1, 2, NULL) evaluates to x != 1 AND x != 2 AND x != NULL, and x != NULL is always NULL (unknown), making the entire expression NULL. Use NOT EXISTS or LEFT JOIN to avoid this trap. This is a classic interview gotcha.