dbt Fundamentals
Master the building blocks of every dbt project — models, sources, refs, materializations, seeds, and project structure. These fundamentals make up the largest portion of the certification exam.
What Is dbt?
dbt (data build tool) is a transformation framework that enables analytics engineers to transform raw data in their warehouse using SQL SELECT statements. dbt handles the DDL/DML (CREATE TABLE, INSERT, etc.) automatically based on your configuration, letting you focus on the business logic.
Models
A dbt model is a SQL SELECT statement saved as a .sql file in the models/ directory. When you run dbt run, dbt compiles each model and executes it against your data warehouse, creating a table or view.
-- models/staging/stg_customers.sql
SELECT
id AS customer_id,
first_name,
last_name,
email,
created_at
FROM {{ source('raw', 'customers') }}
Model Layers
A well-structured dbt project organizes models into layers:
- Staging (stg_): Light transformations on raw sources — renaming, casting, basic filtering. One staging model per source table.
- Intermediate (int_): Business logic that combines staging models. Not exposed to end users.
- Marts (fct_, dim_): Final business-facing models organized by department (finance, marketing, etc.).
Sources
Sources define your raw data tables loaded by your EL tool. They are declared in YAML files and referenced using the {{ source() }} function.
# models/staging/_sources.yml
version: 2
sources:
- name: raw
database: analytics
schema: raw_data
tables:
- name: customers
description: Raw customer data from the production database
loaded_at_field: _etl_loaded_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
- name: orders
description: Raw order transactions
{{ source('source_name', 'table_name') }}. Never use hardcoded table names when a source definition exists.The ref() Function
The {{ ref() }} function is the most important function in dbt. It creates dependencies between models, enabling dbt to build models in the correct order and generate the DAG (Directed Acyclic Graph).
-- models/marts/fct_orders.sql
SELECT
o.order_id,
o.customer_id,
c.first_name,
c.last_name,
o.order_date,
o.amount
FROM {{ ref('stg_orders') }} o
JOIN {{ ref('stg_customers') }} c
ON o.customer_id = c.customer_id
Materializations
Materializations determine HOW dbt builds a model in your warehouse. This is one of the most heavily tested topics on the exam.
view (default)
Creates a SQL view. Fast to build, always reflects latest data. Use for staging models and lightweight transforms. No storage cost.
table
Creates a physical table using CREATE TABLE AS SELECT. Faster queries but rebuilds entirely on each run. Use for mart models queried frequently.
incremental
Appends or merges only new/changed rows. Fastest for large tables with append-only or slowly changing data. Requires an is_incremental() filter.
ephemeral
Not materialized in the warehouse at all. Injected as a CTE into downstream models. Use for simple reusable logic. Cannot be queried directly.
-- Set materialization in the model file
{{ config(materialized='incremental', unique_key='order_id') }}
SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
Seeds
Seeds are CSV files in your seeds/ directory that dbt loads into your warehouse as tables. Use seeds for static reference data like country codes, status mappings, or category lookups.
# Run: dbt seed
# File: seeds/country_codes.csv
# Result: Creates a table named country_codes in your warehouse
Project Structure
my_dbt_project/
dbt_project.yml # Project configuration
profiles.yml # Connection profiles (dbt Core only)
models/
staging/ # Staging models (1:1 with sources)
_sources.yml # Source definitions
_stg_models.yml # Schema tests for staging
stg_customers.sql
stg_orders.sql
intermediate/ # Business logic
int_order_items.sql
marts/ # Final business models
finance/
fct_revenue.sql
marketing/
dim_customers.sql
tests/ # Custom data tests
macros/ # Reusable Jinja macros
seeds/ # CSV reference data
snapshots/ # SCD Type 2 snapshots
analyses/ # Ad-hoc SQL analyses
Practice Questions
Question 1
stg_orders.sql that references a raw table using a hardcoded schema and table name instead of the {{ source() }} function. What problem does this create?A) The model will fail to compile
B) dbt cannot track source freshness or lineage for that table
C) The model will be materialized as ephemeral by default
D) The model cannot be tested
Answer: B — Without a source definition, dbt cannot track lineage (the table will not appear in the DAG), cannot monitor source freshness, and cannot generate proper documentation. The model will still compile and run, but you lose important governance features.
Question 2
A) view
B) table
C) incremental
D) ephemeral
Answer: C — Incremental materialization processes only new or changed rows on each run, making it ideal for large tables with regular inserts. A table materialization would rebuild all 500M rows every run. A view would be too slow for queries. Ephemeral cannot be queried directly.
Question 3
{{ ref() }} function in dbt?A) It references external APIs for data loading
B) It creates dependencies between models, enabling dbt to build the DAG and resolve execution order
C) It references Python functions for custom transformations
D) It creates foreign key constraints in the warehouse
Answer: B — The ref() function is the foundation of dbt's dependency management. It tells dbt which models depend on which, allowing dbt to build the DAG and execute models in the correct order. It also handles environment-specific schema resolution.
Question 4
A) view
B) table
C) incremental
D) ephemeral
Answer: D — Ephemeral models are not materialized in the warehouse. They are injected as CTEs into any downstream model that references them. This is ideal for simple reusable logic that does not need to be queried directly.
Question 5
A) Loading 10 million customer records from a production database
B) Loading a 50-row CSV of country codes for reference lookups
C) Streaming real-time event data into the warehouse
D) Extracting data from a REST API
Answer: B — Seeds are designed for small, static reference data like country codes, status mappings, or category lookups. They are CSV files version-controlled in Git. Large datasets (A) should use EL tools. Real-time streaming (C) and API extraction (D) are outside dbt's scope.
Lilly Tech Systems