Intermediate

dbt Fundamentals

Master the building blocks of every dbt project — models, sources, refs, materializations, seeds, and project structure. These fundamentals make up the largest portion of the certification exam.

What Is dbt?

dbt (data build tool) is a transformation framework that enables analytics engineers to transform raw data in their warehouse using SQL SELECT statements. dbt handles the DDL/DML (CREATE TABLE, INSERT, etc.) automatically based on your configuration, letting you focus on the business logic.

💡
Exam focus: The exam tests your understanding of dbt's role in the modern data stack: dbt transforms data that is already loaded in the warehouse. It does NOT extract or load data (that is the job of tools like Fivetran, Airbyte, or Stitch).

Models

A dbt model is a SQL SELECT statement saved as a .sql file in the models/ directory. When you run dbt run, dbt compiles each model and executes it against your data warehouse, creating a table or view.

-- models/staging/stg_customers.sql
SELECT
    id AS customer_id,
    first_name,
    last_name,
    email,
    created_at
FROM {{ source('raw', 'customers') }}

Model Layers

A well-structured dbt project organizes models into layers:

  • Staging (stg_): Light transformations on raw sources — renaming, casting, basic filtering. One staging model per source table.
  • Intermediate (int_): Business logic that combines staging models. Not exposed to end users.
  • Marts (fct_, dim_): Final business-facing models organized by department (finance, marketing, etc.).

Sources

Sources define your raw data tables loaded by your EL tool. They are declared in YAML files and referenced using the {{ source() }} function.

# models/staging/_sources.yml
version: 2
sources:
  - name: raw
    database: analytics
    schema: raw_data
    tables:
      - name: customers
        description: Raw customer data from the production database
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
      - name: orders
        description: Raw order transactions
Common exam trap: Sources are NOT models. They represent raw data loaded by external tools. You declare them in YAML and reference them with {{ source('source_name', 'table_name') }}. Never use hardcoded table names when a source definition exists.

The ref() Function

The {{ ref() }} function is the most important function in dbt. It creates dependencies between models, enabling dbt to build models in the correct order and generate the DAG (Directed Acyclic Graph).

-- models/marts/fct_orders.sql
SELECT
    o.order_id,
    o.customer_id,
    c.first_name,
    c.last_name,
    o.order_date,
    o.amount
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c
    ON o.customer_id = c.customer_id

Materializations

Materializations determine HOW dbt builds a model in your warehouse. This is one of the most heavily tested topics on the exam.

view (default)

Creates a SQL view. Fast to build, always reflects latest data. Use for staging models and lightweight transforms. No storage cost.

table

Creates a physical table using CREATE TABLE AS SELECT. Faster queries but rebuilds entirely on each run. Use for mart models queried frequently.

incremental

Appends or merges only new/changed rows. Fastest for large tables with append-only or slowly changing data. Requires an is_incremental() filter.

ephemeral

Not materialized in the warehouse at all. Injected as a CTE into downstream models. Use for simple reusable logic. Cannot be queried directly.

-- Set materialization in the model file
{{ config(materialized='incremental', unique_key='order_id') }}

SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

Seeds

Seeds are CSV files in your seeds/ directory that dbt loads into your warehouse as tables. Use seeds for static reference data like country codes, status mappings, or category lookups.

# Run: dbt seed
# File: seeds/country_codes.csv
# Result: Creates a table named country_codes in your warehouse
📚
Exam note: Seeds are for small, static datasets (typically under 1,000 rows). They are version-controlled in Git. Do NOT use seeds for large datasets or frequently changing data — use sources and EL tools for that.

Project Structure

my_dbt_project/
  dbt_project.yml        # Project configuration
  profiles.yml           # Connection profiles (dbt Core only)
  models/
    staging/             # Staging models (1:1 with sources)
      _sources.yml       # Source definitions
      _stg_models.yml    # Schema tests for staging
      stg_customers.sql
      stg_orders.sql
    intermediate/        # Business logic
      int_order_items.sql
    marts/               # Final business models
      finance/
        fct_revenue.sql
      marketing/
        dim_customers.sql
  tests/                 # Custom data tests
  macros/                # Reusable Jinja macros
  seeds/                 # CSV reference data
  snapshots/             # SCD Type 2 snapshots
  analyses/              # Ad-hoc SQL analyses

Practice Questions

Question 1

Q1
A dbt project has a model stg_orders.sql that references a raw table using a hardcoded schema and table name instead of the {{ source() }} function. What problem does this create?

A) The model will fail to compile
B) dbt cannot track source freshness or lineage for that table
C) The model will be materialized as ephemeral by default
D) The model cannot be tested

Answer: B — Without a source definition, dbt cannot track lineage (the table will not appear in the DAG), cannot monitor source freshness, and cannot generate proper documentation. The model will still compile and run, but you lose important governance features.

Question 2

Q2
An analytics engineer needs to transform a 500-million-row fact table that receives 100,000 new rows daily. Which materialization is most appropriate?

A) view
B) table
C) incremental
D) ephemeral

Answer: C — Incremental materialization processes only new or changed rows on each run, making it ideal for large tables with regular inserts. A table materialization would rebuild all 500M rows every run. A view would be too slow for queries. Ephemeral cannot be queried directly.

Question 3

Q3
What is the purpose of the {{ ref() }} function in dbt?

A) It references external APIs for data loading
B) It creates dependencies between models, enabling dbt to build the DAG and resolve execution order
C) It references Python functions for custom transformations
D) It creates foreign key constraints in the warehouse

Answer: B — The ref() function is the foundation of dbt's dependency management. It tells dbt which models depend on which, allowing dbt to build the DAG and execute models in the correct order. It also handles environment-specific schema resolution.

Question 4

Q4
Which materialization should be used for a simple calculation that is referenced by multiple downstream models but should NOT exist as a table or view in the warehouse?

A) view
B) table
C) incremental
D) ephemeral

Answer: D — Ephemeral models are not materialized in the warehouse. They are injected as CTEs into any downstream model that references them. This is ideal for simple reusable logic that does not need to be queried directly.

Question 5

Q5
Which of the following is a valid use case for dbt seeds?

A) Loading 10 million customer records from a production database
B) Loading a 50-row CSV of country codes for reference lookups
C) Streaming real-time event data into the warehouse
D) Extracting data from a REST API

Answer: B — Seeds are designed for small, static reference data like country codes, status mappings, or category lookups. They are CSV files version-controlled in Git. Large datasets (A) should use EL tools. Real-time streaming (C) and API extraction (D) are outside dbt's scope.