So far, we’ve spun up a lab environment, landed and de-duplicated our data, used Postgres’ JSONB capabilities to extract fields, and staged all of it into an emerging model with dbt. Our understanding of the current Scryfall data schema is taking shape, and we’re fairly confident in it. Today, we’re going to talk about the future. In particular, we’re going to discuss how that schema might change, and what we might do about it with dbt.
In this series on exploring data, we use dbt to plumb the depths of a semi-structured JSON data set with the help of PostgreSQL’s JSONB datatype and related functions and operators. Throughout the series we tackle everything from simple field cardinality, to future-proofing ingestion pipelines, to multi-value columnar re-assembly.
If you found us via search, and want a little background and context, you may want to read the whole series from the beginning:
- Loading & Landing where we get the lab spun up, load the data, and do some clean up.
- Extracting JSON Fields in which we look at what’s in those JSON data blobs.
- Anticipating the Future by writing tests to detect new codes and schema changes.
- Re-Normalizing Data to pull a new model — card sets — out of card data.
Change is Inevitable, not Impossible
Things change. In computing, there are two types of changes: breaking changes and horribly breaking changes.
A breaking change is one that causes the computer system itself to stop. In data, this is often data type mismatches. We expected a number and we got a bit of text. We were told we’d always get 19 columns in the import file, but this time we got 20. Or 18. Or 1.
Breaking changes are great because they are noisy. They jam up the machinery and make a terrible racket. People notice them. And they get fixed.
Horribly breaking changes are far more insidious. Dollar values get changed into pennies, but remain numeric. First and last name columns are flipped. A numeric field is zeroed out because it is now deprecated in the upstream system.
These types of changes are so much worse precisely because they don’t make any real noise. Our data ingestion pipelines, transformation routines, and reporting systems happily ingest, modify, update, and aggregate these values, oblivious to the fact that nothing is correct any more.
With what we’ve done so far staging this Scryfall data, we’ve now got two opportunities for horribly breaking changes:
- The
color_identity
array could have an unrecognized value - Wizards of the Coast could introduce a new
rarity_code
In the first case, our model becomes less accurate. Let’s say a
new color gets introduced: purple. If we are building aggregate
fields like is_colorless
based on the current five-color wheel,
purple breaks this without much complaint. All mono-purple cards
would come up in the search as colorless!
In the second case, our data model is intact, but the contract with our customers, the players, is broken. No doubt when we handed this data set over to them for querying, we furnished them with documentation. Here’s what this field means, here are the possible values for rarity, etc. The silent introduction of a new rarity breaks our documentation. Everything else works. This is bad.
What can we do? We can catch these issues before we ingest the problematic data. (Note: just because it’s correct doesn’t mean it’s not problematic)
dbt lets us do that by providing tests. The simplest of these are
called column tests, and they apply to a single column in a single
table. They get defined in a YAML file in the model directory.
We’ll put ours at models/staging/schema.yml
. Here’s an assertion
that our staged rarity codes will only ever be from a fixed list
of possible values:
version: 2
models:
- name: staged_cards
columns:
- name: rarity_code
tests:
- accepted_values:
values:
- common
- uncommon
- rare
- mythic
If we add that test in now, and issue a dbt run, we’ll get a failure:
→ dbt test
17:40:06 Running with dbt=1.1.0
17:40:06 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging
17:40:06 Found 3 models, 1 test, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
17:40:06
17:40:06 Concurrency: 4 threads (target='dev')
17:40:06
17:40:06 1 of 1 START test accepted_values_staged_cards_rarity_code__common__uncommon__rare__mythic [RUN]
17:40:06 1 of 1 FAIL 1 accepted_values_staged_cards_rarity_code__common__uncommon__rare__mythic [FAIL 1 in 0.05s]
17:40:06
17:40:06 Finished running 1 test in 0.20s.
17:40:06
17:40:06 Completed with 1 error and 0 warnings:
17:40:06
17:40:06 Failure in test accepted_values_staged_cards_rarity_code__common__uncommon__rare__mythic (models/staging/schema.yml)
17:40:06 Got 1 result, configured to fail if != 0
17:40:06
17:40:06 compiled SQL at target/compiled/scryfall/models/staging/schema.yml/accepted_values_staged_cards_2e6844329b8e66cc535e125b1e35f685.sql
17:40:06
17:40:06 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
I assure you that these are the only officially-sanctioned
rarities for printed Magic: the Gathering playing cards. That
said, Scryfall has apparently tagged some promotional cards with a
rarity code of special
. If we want to see which cards, we can run
the tests and have dbt store the failures in a table:
→ dbt test --store-failures
17:40:52 Running with dbt=1.1.0
17:40:53 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging
17:40:53 Found 3 models, 1 test, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
17:40:53
17:40:53 Concurrency: 4 threads (target='dev')
17:40:53
17:40:53 1 of 1 START test accepted_values_staged_cards_rarity_code__common__uncommon__rare__mythic [RUN]
17:40:53 1 of 1 FAIL 1 accepted_values_staged_cards_rarity_code__common__uncommon__rare__mythic [FAIL 1 in 0.10s]
17:40:53
17:40:53 Finished running 1 test in 0.28s.
17:40:53
17:40:53 Completed with 1 error and 0 warnings:
17:40:53
17:40:53 Failure in test accepted_values_staged_cards_rarity_code__common__uncommon__rare__mythic (models/staging/schema.yml)
17:40:53 Got 1 result, configured to fail if != 0
17:40:53
17:40:53 compiled SQL at target/compiled/scryfall/models/staging/schema.yml/accepted_values_staged_cards_2e6844329b8e66cc535e125b1e35f685.sql
17:40:53
17:40:53 See test failures:
-------------------------------------------------------------------------------------------------------------
select * from "explore"."our_dbt_test__audit"."accepted_values_staged_cards_2e6844329b8e66cc535e125b1e35f685"
-------------------------------------------------------------------------------------------------------------
17:40:53
17:40:53 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
When we run the query that dbt prints out, we can see the offending value:

To remedy this, we first check the Scryfall documentation to
validate that “special” is a real value, and then we can add this
to the list of accepted_values
, and re-run dbt test
until it is
green.
It’s important to note that dbt created a special schema for
storing errors. This is nice because it keeps dbt’s tables out of
our way; we don’t have to worry so much about object naming
collisions. As such, we can globally enable the storage of test
failures by adding this to the end of our dbt_project.yml
:
# rest of dbt_project.yml here ...
tests:
+store_failures: true
The second kind of test is the schema test. These exist in the
tests/
directory, as SQL files, and can assert pretty much
anything you can dream up. If the query you write returns ANY
results at all, the tests is considered a failure. That means you
can literally select for the erroneous circumstances and let dbt
figure out what to do next.
This is how we can verify that the raw color_identity
attribute is
only ever a list of the five known color codes:
with found_colors as (
select distinct
jsonb_array_elements_text(card->'color_identity') as color
from {{ ref('raw_cards') }}
)
select color
from found_colors
where color not in ('W', 'U', 'B', 'R', 'G')
As soon as that query pops out a record, we’ve found a color we
didn’t know about. dbt will halt the transformation pipeline and
let us know (via our CI/CD tooling of choice). This turns a
horribly breaking change into a (handleable) breaking change. Put
that SQL into tests/staging/valid_colors.sql
for safe-keeping.
As we progress through the rest of our Scryfall data, we’ll be on the lookout for tests we can write. Wherever possible, if you find yourself saying “that will never happen” go ahead and write the test. If you are correct and it doesn’t ever happen then the tests will never fail. The minute they do, however, you’ve caught a problem earlier and have a good chance to fix it.
Next up, we’re going to look at expanding enumerated, single-value fields into lots of Boolean columns. This pattern pops up often, both in the Scryfall data and in the wild, and we’ve got a few tricks up our sleeves that will prove extremely handy.
Expanding multi-value fields into booleans
In the game of twenty questions, one player thinks of a thing and the remaining players are permitted to ask up to 20 questions about that thing. Each question must be Boolean–that is to say the answers come from the set {yes, no}. The question space itself is vast. One could ask “is it smaller than a loaf of bread?” and thereby partition the space of all things in twain. One could also ask “is its name ‘Frances McGillicutty’?”, although that would admittedly be a waste of a question.
Surprisingly, starting with broad partitioning questions (like “is it an animal?”) we very quickly narrow down the answer space. With perfectly balanced partitioning questions, this begins to resemble a binary search algorithm in a numeric space. Given a number between 1 and 100 (let’s assume I pick 72), I need only ask the following:
- Is it greater than 50? Yes
- Is it greater than 75? No
- Is it greater than 62? Yes
- Is it greater than 68? Yes
- Is it greater than 72? No
- Is it greater than 70? Yes
- Is it greater than 71? Yes
The answer, of course, is 72. Binary decisions are quite powerful and compact. They are also easy for human minds to reason about. We are seemingly hard-wired to classify the world.
What does this have to do with Scryfall data? Consider this query:
select distinct card->'legalities'
from our_stg.raw_cards
The legalities
column is an expandable sub-object that tracks what
formats each card is legal in. While the core rules of Magic: the
Gathering are universal, there are a number of different formats
that constrain the overall pool of cards that can be used when
building your deck. For example, Standard format rotates older
cards out to keep the tournament scene from stagnating around
historically powerful deck types. At any given point in time,
Standard only includes cards that were printed in the last two
years. Modern, on the other hand, allows all cards printed from
8th Edition onward (circa 2003).
It is perfectly normal for a card to be legal in multiple formats simultaneously. Each format also has its own unique list of cards that have been explicitly banned, often due to power imbalances that ruin the fun of playing semi-competitively. To that end, Scryfall tracks the legality on a card-by-card basis.
From an ergonomic perspective, our player-customers would like to
be able to query cards based on legality. Using our database to
find cards based on color identity, mechanics, and keywords is
pointless if we cannot restrict the results to cards legal to our
format of choice. To that end, we need to convert this legality
object into a number of predicate columns like legal_in_standard
and legal_in_modern
.
Here’s how we do that:
select
-- other fields from our staged_cards query...
-- legalities
card->'legalities'->>'standard' = 'legal' as legal_in_standard
from
{{ ref('raw_cards') }}
We can easily figure out which formats Scryfall has data for with
a jsonb_object_keys()
query in a distinct select:
select distinct
jsonb_object_keys(card->'legalities') as format
from
our_stg.raw_cards

We have 19 known formats. For completeness sake, let’s model them all. For sanity’s sake, let’s not write the SQL by hand.
Instead, we’ll start relying on more of the Python side of dbt, by way of Jinja templating. SQL lacks an imperative for loop, but Python has one, and with dbt we can use it:
select
-- other fields from our staged_cards query...
-- legalities
{% for format in ['standard'] %}
card->'legalities'->>'{{ format }}' = 'legal'
as legal_in_{{ format }}
{% endfor %}
from
{{ ref('raw_cards') }}
This is functionally equivalent to our former model query, but is far more extensible. Let’s break it down bit by bit.
The {%
and %}
symbols start Jinja’s statement mode where we can
write Python-esque code that alters flow control–if/then/else
conditionals, looping constructs, etc. We’ve already seen the {{
and }}
symbols; those delineate expressions which are evaluated
and then inserted into the final output.
Here we’re using both. First we open a for loop, assigning each
successive value of our list [‘standard’]
to the template variable
named format. Everything from the {% for … in … %}
statement until
the {% endfor %}
statement is considered the body of the loop.
This will be evaluated once for each iteration of the loop, and
the value of the format
variable will change as we progress
through the list.
It’s worth pointing out that in the body of the loop, we use {{ format }}
twice: once to pull the legality value out of our card
JSON object, and once to dynamically build the column name. This
is a powerful abstraction indeed. We should definitely note here
that Jinja is a textual templating language–it knows absolutely
nothing about SQL or its syntax. That often works to our
advantage, letting us paste together fixed strings like legal_in
and variables to get new columns. That’s also why we still have to
use single-quotes when we dereference the legalities JSON
sub-object; Jinja will only emit the string “standard”, but
PostgreSQL needs the surrounding quotes.
As we said, this model and its previous iteration are functionally
equivalent. They both produce a new table with a single legality
column, legal_in_standard
. Let’s add a second format to our list:
select
-- other fields from our staged_cards query...
-- legalities
{% for format in ['standard', 'modern'] %}
card->'legalities'->>'{{ format }}' = 'legal'
as legal_in_{{ format }}
{% endfor %}
from
{{ ref('raw_cards') }}
We could add more, but in the interest of getting to the scene of the accident quickly and efficiently, let’s run the model.
$ dbt run
18:55:43 Running with dbt=1.1.0
18:55:43 Unable to do partial parsing because a project config has changed
18:55:43 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging
18:55:43 Found 3 models, 1 test, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
18:55:43
18:55:44 Concurrency: 4 threads (target='dev')
18:55:44
18:55:44 1 of 3 START table model our_stg.scryfall_data ................................. [RUN]
18:56:03 1 of 3 OK created table model our_stg.scryfall_data ............................ [SELECT 378 in 19.76s]
18:56:03 2 of 3 START view model our_stg.raw_cards ...................................... [RUN]
18:56:04 2 of 3 OK created view model our_stg.raw_cards ................................. [CREATE VIEW in 0.18s]
18:56:04 3 of 3 START table model our_stg.staged_cards .................................. [RUN]
18:56:04 3 of 3 ERROR creating table model our_stg.staged_cards ......................... [ERROR in 0.04s]
18:56:04
18:56:04 Finished running 2 table models, 1 view model in 20.20s.
18:56:04
18:56:04 Completed with 1 error and 0 warnings:
18:56:04
18:56:04 Database Error in model staged_cards (models/staging/staged_cards.sql)
18:56:04 syntax error at or near "card"
18:56:04 LINE 27: card->'legalities'->>'modern' = 'legal'
18:56:04 ^
18:56:04 compiled SQL at target/run/scryfall/models/staging/staged_cards.sql
18:56:04
18:56:04 Done. PASS=2 WARN=0 ERROR=1 SKIP=0 TOTAL=3
Remember: Jinja knows nothing of SQL. When we were looping over a list of one item, at the end of our column selection sub-clause, we didn’t need any commas. Now that we are looping over two items, we need those field delimiters. Adding that comma back in seems like the correct course of action:
select
-- other fields from our staged_cards query...
-- legalities
{% for format in ['standard', 'modern'] %}
card->'legalities'->>'{{ format }}' = 'legal'
as legal_in_{{ format }}
,
{% endfor %}
from
{{ ref('raw_cards') }}
Two steps forward, two steps back. Instead of having too few commas, we now have too many!
$ dbt run
18:56:51 Running with dbt=1.1.0
18:56:51 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging
18:56:51 Found 3 models, 1 test, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
18:56:51
18:56:51 Concurrency: 4 threads (target='dev')
18:56:51
18:56:51 1 of 3 START table model our_stg.scryfall_data ................................. [RUN]
18:57:11 1 of 3 OK created table model our_stg.scryfall_data ............................ [SELECT 378 in 19.57s]
18:57:11 2 of 3 START view model our_stg.raw_cards ...................................... [RUN]
18:57:11 2 of 3 OK created view model our_stg.raw_cards ................................. [CREATE VIEW in 0.18s]
18:57:11 3 of 3 START table model our_stg.staged_cards .................................. [RUN]
18:57:11 3 of 3 ERROR creating table model our_stg.staged_cards ......................... [ERROR in 0.04s]
18:57:11
18:57:11 Finished running 2 table models, 1 view model in 20.09s.
18:57:11
18:57:11 Completed with 1 error and 0 warnings:
18:57:11
18:57:11 Database Error in model staged_cards (models/staging/staged_cards.sql)
18:57:11 syntax error at or near "from"
18:57:11 LINE 32: from
18:57:11 ^
18:57:11 compiled SQL at target/run/scryfall/models/staging/staged_cards.sql
18:57:11
18:57:11 Done. PASS=2 WARN=0 ERROR=1 SKIP=0 TOTAL=3
The quick and dirty solution is to rearrange some columns so that we always need the trailing comma. If you want to explore that on your own, be my guest.
A more robust solution, however, would take into account precisely
where in the list we are, and selectively emit the comma as
needed. That’s what the loop.last
construct is for:
select
-- other fields from our staged_cards query...
-- legalities
{% for format in ['standard', 'modern'] %}
card->'legalities'->>'{{ format }}' = 'legal'
as legal_in_{{ format }}
{% if not loop.last %},{% endif %}
{% endfor %}
from
{{ ref('raw_cards') }}
When we are on “standard” (the first item of two), loop.last
is
false, not loop.last
is true, and the {% if … %}
conditional
fires. We get a comma.
When we get to “modern” (the last item of two), loop.last
is true,
not loop.last
is false, and we skip the if
body. No comma, no
problem.
(Note: I find it advisable to include this comma logic in every
loop body that assembles a variable number of fields; if
you have multiple, you can separate each for loop from the
other fields with a comma after the {% endfor %}
statement.
This seemingly convoluted dance avoids more dumb
comma-usage errors than you’d think!)
We are now in a position to fill out the rest of that list. Jinja
isn’t picky about white space within the {%
and %}
statement
delimiters, so we can format our list with each item on its own
line. Remember: you’ll be reading this code many more times than
you will write it.
select
-- other fields from our staged_cards query...
-- legalities
{% for format in [
'standard',
'modern',
-- add more formats here as needed...
] %}
card->'legalities'->>'{{ format }}' = 'legal'
as legal_in_{{ format }}
{% if not loop.last %},{% endif %}
{% endfor %}
from
{{ ref('raw_cards') }}
New formats come and go, albeit at a reasonably slow pace. 2019 saw the introduction of the Pioneer format, a non-rotating card pool containing every card from the Return to Ravnica set in 2012 onward. When the next new format launches, our model will happily ingest and transform only a subset of the available format legality data.
As we did in our last installment, we’re going to set up a test.
Since we’re not dealing with a single column, the
accepted_values
test we used for the rarity codes won’t work.
We’re going to need a schema test.
A first draft at this test SQL looks like this:
with found_formats as (
select distinct
jsonb_object_keys(card->'legalities') as format
from
{{ ref('raw_cards') }}
)
select format
from found_formats
where format not in (
'standard',
'modern',
-- etc.
'pioneer'
)
Note: we use {{ ref('raw_cards') }}
in our schema tests, just as
we do in our model definitions, to let dbt know when a given test
should be run. There’s little point in continuing with other
transformations based on the raw_cards
model (i.e. all of the rest
of them) if there’s a known problem. Absent the relationship-identifying call
to ref(),
dbt would not run our test until the very end of the
transformation pipeline.
It’s a sound test, it really is. As soon as a new format is
launched, and a new key in the legalities
JSON object arrives from
Scryfall, dbt will halt the transformation pipeline and let us
know.
The one thing I don’t like about it is this: the list of known formats is repeated. If we add the new format code to the model SQL but not the test, the test will still fail. If we add is to the test’s WHERE clause but forget to update the model we have made a breaking change into a horribly breaking change.
In software engineering, there is a widely adopted principle called DRY – Don’t Repeat Yourself. It’s a more general application of the ideas behind third normal form in database design. Repetition of truths in software engineering–magic numbers, validation routines, etc.–invariably leads to inconsistency as some copies get updated and others do not. The acronym is handy in that it also serves as a verb. How can we DRY up this format list?
The first tool we might reach for would be a database table. After all, what is a list of format codes if not a table of a single domain and very very few rows? dbt even has a built-in mechanism for populating such tables: seeds.
$ mkdir -p seeds/staging
$ vim seeds/staging/known_format_codes.csv
A seed takes a CSV file and turns into a table in the database. Here’s our CSV file:
format_code
"alchemy"
"brawl"
"commander"
"duel"
"explorer"
"future"
"gladiator"
"historic"
"historicbrawl"
"legacy"
"modern"
"oldschool"
"pauper"
"paupercommander"
"penny"
"pioneer"
"premodern"
"standard"
"vintage"
To load this seed file, use dbt seed
. This will create a table
called known_format_codes
. Future dbt run
calls that involve
references to seed tables will transparently bring these seed
files into the database. With this lookup table at our disposal,
we can rewrite our schema test to be more generic:
with
known_formats as (
select format_code as known_code
from {{ ref('known_format_codes') }}
),
found_formats as (
select distinct
jsonb_object_keys(card->'legalities') as found_code
from
{{ ref('raw_cards') }}
)
select found_code, known_code
from found_formats
left join known_formats
on known_formats.known_code = found_formats.found_code
where known_code is null
(Go ahead and put this in tests/staging/no_new_format_codes.sql
,
so that dbt test
will always run it.)
We’ve traded having to update the WHERE
clause of our schema test
SQL for having to update a CSV file. Ideally, we’d like our Jinja
for loop to pull from this table as well.
Sadly, that’s not going to happen. To understand why, we have to
dig a little deeper into the nuts and bolts of dbt. At its core,
dbt is a SQL orchestration tool. It generates SQL code from a
combination of source SQL code, user-provided templating (hello,
for loops!) and some internal know-how. This is known as the compilation
phase. If you want, you can get dbt to just compile your SQL code
with the dbt compile
command. Check the contents of the target/
directory afterwards to see what it’s done.
Compilation happens without connectivity to the database. That means we cannot know what is in the database as we’re building out the SQL to change what is in the database. That’s actually a feature. With a feedback loop from the database back into the dbt compilation routines, we could end up in some very difficult situations that defy reason.
Upon further reflection, what we are trying to do–have a table’s contents drive the columnar definition of another table, is a bad idea. All we really wanted to do was have a single definition of our list of formats. For that, we can use dbt variables
We can define variables in our dbt_project.yml
manifest and then
use them from our Jinja templates. This allows us to put the list
of formats in one place (the manifest) and then use it from both
the model SQL and the schema test.
Variables go under the vars:
section of dbt_project.yml
:
# other dbt_project.yml stuff, like models: and tests: ...
vars:
scryfall:
known_formats:
- alchemy
- brawl
- commander
- duel
- explorer
- future
- gladiator
- historic
- historicbrawl
- legacy
- modern
- oldschool
- pauper
- paupercommander
- penny
- pioneer
- premodern
- standard
- vintage
We now have a global list of formats. Let’s revisit our schema
test, only this time we will programmatically build up the WHERE
clause:
with found_formats as (
select distinct
jsonb_object_keys(card->'legalities') as format
from {{ ref('raw_cards') }}
)
select format
from found_formats
where format not in (
{% for format in var('known_formats') %}
'{{ format }}'
{% if not loop.last %},{% endif %}
{% endfor %}
)
Since this is a Jinja variable, we can also use it in our model SQL:
select
-- other fields from our staged_cards query...
-- legalities
{% for format in var('known_formats') %}
card->'legalities'->>'{{ format }}' = 'legal'
as legal_in_{{ format }}
{% if not loop.last %},{% endif %}
{% endfor %}
from
{{ ref('raw_cards') }}
Et voila! When a new format comes out, the pipeline halts thanks
to our schema test, we update the known_formats
variable and run
it again. We no longer need our seed CSV file, so we can remove it
outright.
This same pattern shows up a half dozen more times in this particular data set. Each time it does, we can solve it the same way:
- Build a list of known values in a dbt variable,
- Write a schema test to look for unknown values,
- Modify the
staged_cards
model to add new columns In a few cases, we will even use steps 1 and 2 (without step 3) to
verify that we aren’t ignoring keys in the JSON objects we deal with.
For example, there is a prices
object that holds current market
rates for printed cards in a variety of currencies and variations.
Some Magic cards have a printing treatment called “foiling” that
makes parts of the card semi-reflective. These flashy cards are
always more rare than their non-foil versions, and therefore
command higher prices on the open market. To ensure that we are
capturing as much data from Scryfall as possible, we may want to
validate that new currencies or finishes haven’t been added since
we last wrote our price ingestion code!
In this installment, we’ve managed to do some pretty amazing stuff with data transformation. We’ve built an extensible, single source-of-truth means of catching new input data. We used the power of Jinja flow control constructs to make wide Boolean columns. Hopefully, we’ve gained a better understanding of how Jinja and dbt work together, and a glimpse at what is possible in other data transformation scenarios.
Next up, we’ll finish off our exploration of the Scryfall data by extracting the set data that is embedded in the card details.
Thanks for reading this installment of the blog series Exploring Semi-Structured JSON Data Sets With dbt. If you’ve struggling with ETL sprawl, trying to wrap your brain around ELT and its impact on the rest of your data operation, or just want to learn more about transforming your business using a solid SQL-based automation tool, then dbt may just be the thing you need.
In any event, you may want to read or re-read the other posts in the series. Here they are:
- Loading & Landing where we get the lab spun up, load the data, and do some clean up.
- Extracting JSON Fields in which we look at what’s in those JSON data blobs.
- Anticipating the Future by writing tests to detect new codes and schema changes.
- Re-Normalizing Data to pull a new model — card sets — out of card data.