One thing I really like about dbt is Jinja. The fact that there is a built-in templating language, complete with modern control structures like if-then-else and for loops, makes writing SQL more meta than ever before.
Today, I want to talk about bridging the divide between dbt variables defined in your project YAML file, and the (SQL) model files where you want to use them.
(By the way, this blog post was inspired by a question on the dbt
Slack org. If you’re a dbt nerd and want to help out the
community, spend a few hours in #dbt-advice-help
).
To define a variable, we place it in a stanza in the dbt_project.yml file at the root of the project code base. For example:
# dbt_project.yml
name: demo
profile: demo
version: 1.0.0
config-version: 2
vars:
cutoff_date: 2020-07-01
Here we define a variables, cutoff_date
, which is actually a string. We can use that to interrogate our customer list to find anyone who joined after June of 2020:
-- new_customers.sql
with
customers as (select * from {{ ref('customers') }}),
events as (select * from {{ ref('events') }}
where kind = 'JOIN')
select
customers.id as customer_id,
customers.name as customer_name,
customers.status as customer_status
events.date_of as customer_join_date
from
customers inner join events
on events.customer_id = customers.id
where
events.date_of >= {{ var('cutoff_date') }}
If we later decide to change the cutoff date for “new” customers, we can change the YAML file and re-run the model.
Fast-forward a few months, and our stakeholder has asked that we build a similar customer filter based on a handful of statuses. If a customer is a GOLD, PREMIERE, ELITE, or ELITE2 customer, we need to consider them an upper echelon customer, and pull them into a new export or reporting table.
We can easily embed this data in the model query:
-- upper_echelon_customers.sql
with
customers as (select * from {{ ref('new_customers') }})
select *
from
customers
where
customer_status IN (
'GOLD',
'PREMIERE',
'ELITE',
'ELITE2'
)
This works great if this is the one and only place we need to specify the list of status codes that qualify. As soon as we need to re-use that list, we’re better served by abstracting that list away. We can either use a table in the database (which is stock standard, and I won’t be covering it here) or we can more dbt variables.
Here’s a first attempt:
# dbt_project.yml
vars:
cutoff_date: 2020-07-01
is_upper_echelon: GOLD
is_also_upper_echelon: PREMIERE
also_upper_echelon: ELITE
its_upper_echelon_too: ELITE2
And the associated model query:
-- upper_echelon_customers.sql
with
customers as (select * from {{ ref('new_customers') }})
select *
from
customers
where
customer_status IN (
{{ var('is_upper_echelon') }},
{{ var('is_also_upper_echelon') }},
{{ var('also_upper_echelon') }},
{{ var('is_upper_echelon_too') }}
)
This works, for very narrowly-constrained values of “works”. For starters, it doesn’t handle having additional qualifying status codes. It also doesn’t take too kindly to removal of one of the four currently qualifying statuses.
Remembering that Jinja is just templating the SQL output before it gets handed off to the warehouse for execution, we could cheat and embed a fragment of SQL into the dbt variable definition, like this:
# dbt_project.yml
vars:
cutoff_date: 2020-07-01
upper_echelon: "('GOLD', 'PREMIERE', 'ELITE', 'ELITE2')"
And the associated model query:
-- upper_echelon_customers.sql
with
customers as (select * from {{ ref('new_customers') }})
select *
from
customers
where
customer_status IN {{ var('upper_echelon') }}
This also works, but it does limit you to using the list of states as part of an IN
or NOT IN
predicate.
What we really want to do is specify a list in the YAML project file and consume a list in our model query. We can do the first part easily enough using YAML list syntax:
# dbt_project.yml
vars:
cutoff_date: 2020-07-01
upper_echelon:
- GOLD
- PREMIERE
- ELITE
- ELITE2
This will translate from YAML into Python (in Jinja) as a list of four strings. To generate SQL using those, we turn to the foreach
construct in Jinja:
-- upper_echelon_customers.sql
with
customers as (select * from {{ ref('new_customers') }})
select *
from
customers
where
customer_status IN (
{% foreach code in var('upper_echelon') %}
'{{ code }}' {% if not loop.last %},{% end %}
{% end %}
)
This is a bit more involved than our previous efforts. Essentially what we’re doing is iterating over the list (in the context of Jinja, run by dbt) and printing out each status code literal and commas as we need them. The {% foreach v in l %}…{% end %}
is the iterating engine. The {% if not loop last %},{% end %}
bit gets us a comma on each iteration of the loop, except for the last one.
This is a far sight more flexible. We can use it to build out custom columns:
with
revenue as (select * from {{ ref('aggr_rev_by_year_by_cust') }})
select
year,
{% foreach status in var('upper_echelon') %}}
is_{{ status }}_status{% if not loop.last %},{% end %}
{% end %}
from revenue
The most important thing to keep in mind when dealing with variable usage beyond substitution is that context matters. If you can keep track of when you are in Jinja-land and when you are in SQL-land, things start to make a whole lot more sense, and you unlock a lot more power.
Happy Hacking!