In the last essay, we got our lab environment spun up (thanks to the wonders of Docker Compose), set up our dbt project, and wrote our first model to de-duplicate landed JSON blobs. Today, we’re going to roll up our sleeves and dive right into the JSON objects we’re getting out of Scryfall.
In this series on exploring data, we use dbt to plumb the depths of a semi-structured JSON data set with the help of PostgreSQL’s JSONB datatype and related functions and operators. Throughout the series we tackle everything from simple field cardinality, to future-proofing ingestion pipelines, to multi-value columnar re-assembly.
If you found us via search, and want a little background and context, you may want to read the whole series from the beginning:
- Loading & Landing where we get the lab spun up, load the data, and do some clean up.
- Extracting JSON Fields in which we look at what’s in those JSON data blobs.
- Anticipating the Future by writing tests to detect new codes and schema changes.
- Re-Normalizing Data to pull a new model — card sets — out of card data.
What’s in those JSON blobs?
As I try to wrap my brain around a new wild and woolly data set, I find myself running small experiments in something like pgAdmin or DataGrip. Those small experiments either fizzle out quickly, or evolve into larger and more complete queries, eventually becoming permanent fixtures of my dbt pipeline. I’ll be following the same path as we work through exploring the Scryfall data.
First up, let’s pick a loaded blob at random and look at the JSON payload, in all of its glory and splendor:
select data from our_stg.scryfall_data limit 1

That’s a lot of stuff. The first thing I notice, however, is that our returned value is a JSON object–not a scalar, and definitely not a list. That means we can extract just the top-level key names and look at those:
select distinct
jsonb_object_keys(data) as key
from our_stg.scryfall_data

There’s only five keys, which is quite manageable. Let’s see what the cardinality on all of those metadata keys (i.e. not payload itself) look like.
select distinct data->'has_more'
from our_stg.scryfall_data

As is probably obvious in retrospect, the has_more
field is a
Boolean and only ever takes on two values: true
or false
. What
about the object
key?
select distinct data->'object'
from our_stg.scryfall_data

Well, at least we’re dealing with homogenous blobs; every single one of them is an encoded list of other things.
select (distinct data->total_cards)::integer as value
from our_stg.scryfall_data
order by value desc

Now we’re getting somewhere. While we are dealing exclusively with lists, not all of the lists are the same length. Some of these blobs hold close to a thousand card definitions, while others are barely lists at all.
Feel free to run the same analysis on the next_page
field, but it
will have a high cardinality since each blob represents a single
page of search results from the Scryfall REST API.
It’s time to look at the data
top-level key, which we’re pretty
sure will be a list. We can verify that with the json_typeof
function:
select distinct jsonb_typeof(data->'data')
from our_stg.scryfall_data
Note: if PostgreSQL spits back an error about a json function not existing, check that you are calling the jsonb_ variant, instead of the json_ variant. PostgreSQL matches up functions based on their name and their argument types / return type. Since json_ functions are only defined for json type inputs, and jsonb_ functions are only defined for jsonb type inputs, this error is way more common than you would think.

Everything’s an array, which agrees with our understanding of the
top-level object
field, and gives us our next step: unnesting the
arrays into result sets!
select jsonb_array_elements(data->'data')
from our_stg.scryfall_data

That’s not a bad-looking table. Each row represents a single card, and the artificial boundaries between the blobs have disappeared. If we want to, we can include lineage as well:
select filename as source_file_name,
sha1sum as source_file_sha1sum,
loaded_at as source_file_loaded_at,
jsonb_array_elements(data->'data')
as card
from our_stg.scryfall_data

The unnesting behavior of the jsonb\_array\_elements
function acts
like a sort of inner join, duplicating the singleton fields of
each blob in scryfall\_data
once for each element of the data
array
in each JSON payload.
If we use a common table expression, we find we can even pull things out of the unnested JSON card object and store those as new columns. Let’s pull out three identifying bits of information for each card: its Print ID, its Oracle ID, and its name:
with cards as (
select filename, sha1sum, loaded_at,
jsonb_array_elements(data->'data')
as card
from our_stg.scryfall_data
)
select card->>'id' as print_id,
card->>'oracle_id' as oracle_id,
card->>'name' as name,
filename as source_file_name,
sha1sum as source_file_sha1sum,
loaded_at as source_file_loaded_at,
card
from cards
The ->>
there is not a typo; it extracts the key on the right from
the object on the left and coerces it to a textual value. That is
a->b
pulls value of the b
key out of a
as a JSONB object, whereas
a->>b
pulls the value of the b
key from a
as TEXT.
Now we have a comprehensive enough (and complicated enough!) query
to persist into our dbt pipeline. We’ll call this one raw\_cards
by
putting the latest incarnation of the SQL into
models/staging/raw_cards.sql
and running a specific, targeted dbt run
again (more on this in a little bit).
→ dbt run --select staging.raw_cards
17:02:51 Running with dbt=1.1.0
17:02:51 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging
17:02:51 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0
metrics
17:02:51
17:02:52 Concurrency: 4 threads (target='dev')
17:02:52
17:02:52 1 of 1 START view model our_stg.raw_cards ...................................... [RUN]
17:02:52 1 of 1 OK created view model our_stg.raw_cards ................................. [CREATE VIEW in 0.06s]
17:02:52
17:02:52 Finished running 1 view model in 0.25s.
17:02:52
17:02:52 Completed successfully
17:02:52
17:02:52 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
This creates another view in the our_stg schema:

A small note on inter-model references. If you try to issue a full
dbt run
at this point, there is a good chance your raw_cards
view
will go missing. Here’s why: when dbt looks at the two SQL files
we’ve told it define our models, it sees nothing that links them
in any dependent way. Unfortunately, they are dependent;
raw_cards
uses scryfall_data
in its common table expressions.
dbt, unaware of this fact, attempts to create both views at the
same time. It’s a race condition. If raw_cards
is created first,
then dbt will drop the scryfall_data
(with CASCADE
) and
obliterate the raw_cards
view it just created.
To fix this, we need to teach dbt about this dependency, and we do
that by swapping out the literal “schema.table” reference with
some dbt special sauce: {{ ref('model-name') }}
:
with cards as (
select filename, sha1sum, loaded_at,
jsonb_array_elements(data->'data')
as card
from {{ ref('scryfall_data') }}
)
select card->>'id' as print_id,
card->>'oracle_id' as oracle_id,
card->>'name' as name,
filename as source_file_name,
sha1sum as source_file_sha1sum,
loaded_at as source_file_loaded_at,
card
from cards
We’ll get more into the why later on in the series, so just trust me on this trick for now.
One last thing before we take a break for today: the de-duplicated blobs view is just that: a view. If someone else were to come along and start loading in data, or worse–delete some entries, our analysis would immediately reflect that. After all, we haven’t actually copied the fresh blobs into a separate table. We’re just filtering with a named query!
Luckily, dbt is just as able to build tables as views, it just happens to err on the side of more compute (longer queries) instead of increased storage cost. In dbt parlance, this is called the model materialization and there are four options:
view
, the defaulttable
, built via a create table as select … query (CTAS)ephemeral
, which we won’t be covering until laterincremental
, which we won’t be covering at all
There are many ways to change the materialization of models, but
the easiest is to manually override it on a model-by-model basis.
We do this with some Jinja (the dbt templating language) in the
SQL file that defines the model. Up at the top of the
models/staging/scryfall_data.sql
file, before everything else, put
this little bit of code:
{{
config(
materialized='table'
)
}}
The {{
starts Jinja mode, which }}
ends. While in that mode,
config()
is a call that will override configuration for the
current model. In our case, we’re changing how the model is
materialized, switching from view
(the default) to table
. Now,
when we dbt run
:
→ dbt run
17:10:12 Running with dbt=1.1.0
17:10:12 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging
17:10:12 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0
metrics
17:10:12
17:10:12 Concurrency: 4 threads (target='dev')
17:10:12
17:10:12 1 of 2 START table model our_stg.scryfall_data ................................. [RUN]
17:10:29 1 of 2 OK created table model our_stg.scryfall_data ............................ [SELECT 378 in 16.95s]
17:10:29 2 of 2 START view model our_stg.raw_cards ...................................... [RUN]
17:10:29 2 of 2 OK created view model our_stg.raw_cards ................................. [CREATE VIEW in 0.13s]
17:10:29
17:10:29 Finished running 1 table model, 1 view model in 17.25s.
17:10:29
17:10:29 Completed successfully
17:10:29
17:10:29 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
We can see that scryfall_data
is a table, not a view, and we’re
safely protected from upstream changes until we run the whole
pipeline.
Let’s wrap up, since the next step is a big one. Today, we’ve gotten our de-duplicated blobs from the landing area into the staging area (as a table!), and built our first model to tease apart the aggregate card lists present in each blob. Now we have a table that contains one record per printed card in the game, and we’re even tracking data lineage back to the original files! Next up: extracting even more data from each JSON card object into a staged cards table!
Let’s pull out some card data, shall we?
The stage is set, the actors know their lines and blocking. We
have a workable dbt project, some landed JSON data, and our first
round of cleaning and extraction, vis-a-vis the raw_cards
table.
It’s time to extract some data from the JSON.
Whenever I’m dissecting a bit of JSON, I like to understand what the top-level keys are. The keys tell you a lot about what the designers in the other system felt were important, either from a domain perspective, or from a frequency of reference standpoint. Once these are known, and their data types validated, we can recurse into sub-objects and run the same process. Luckily, this data set comes with some decent document [link] from the fine folks over at Scryfall.
We’ll start with this exploratory SQL:
with keys as (
select jsonb_object_keys(card) as key
from our_stg.raw_cards
)
select key, count(*) as how_many
from keys group by 1

From this analysis, we can find out just how many unique keys there are (there’s 82 of them), if they are uniformly used (they are not), and just how non-uniform they are. For instance, half (47) of the 82 keys are found on each and every one of the 37,411 cards.
That’s actually pretty good.
Let’s start with the basics. Every Magic card has, at a bare minimum, the following:
- A name
- A type line, indicating what type of card it is (land, sorcery spell, artifact, etc.)
- An Oracle ID, which identifies the same card across multiple printings (i.e. reprints and re-releases)
- A Print ID, which identifies a specific printing of a given oracle card
- A casting cost (often referred to as a converted mana cost, abbreviated cmc)
- Rarity, which tracks how likely one is to find a given card in a random sealed pack of cards
We can easily stage these into a table, casting strings to text fields and numeric quantities as appropriate.
select
card->>'id' as print_id,
card->>'oracle_id' as oracle_id,
card->>'name' as name,
card->>'type_line' as type_line,
card->>'cmc' as converted_mana_cost,
card->>'rarity' as rarity_code
from
our_stg.raw_cards
We can call this model a staged card and store the SQL in
models/staging/staged_cards.sql
. To speed things up (as we will be
querying this model a lot), let’s materialize it as a table.
Here’s the full staged_cards.sql
source:
{{
config(
materialized='table'
)
}}
select
card->>'id' as print_id,
card->>'oracle_id' as oracle_id,
card->>'name' as name,
card->>'type_line' as type_line,
card->>'cmc' as converted_mana_cost,
card->>'rarity' as rarity_code
from
{{ ref('raw_cards') }}
That bit at the end where we select from a curly-brace clad ref()
call is a little bit of dbt making life easier for us. We covered
this briefly towards the end of the last post. So far, we’ve
hard-coded schema and table names. This has worked, but only
because we are a single data engineer, and we are aware of the
underlying structure we are asking dbt create. As soon as anyone
else on our team checks out this code to try an experiment, they
will either run afoul of RBAC (if we’re lucky) or trample all over
our stuff (if we’re not).
The ref()
Jinja call asks dbt to figure out what schema and what
table implements the named model. Often the table names will
match, but the schema names are almost never knowable. Using
ref()
, we don’t have to worry about either, and we can let dbt
handle it.
Another great thing about using references like this is that dbt
can then learn about the dependencies between models. It can then
use this information in determining order of operations (who gets
materialized first?), and downstream propagation (if I change
table X, who needs to be re-materialized?). Using ref()
has no
downsides, and all the upsides. We’ll be using it from here on
out.
Alright, back to our fledgling staged_cards
table. I’m assuming at
this point you’ve run dbt run
again to pick up the new table. If
not, do so now.
If we inspect the definition of the table in PostgreSQL, we’ll see that the converted mana cost column is textual, not numeric. That’s going to make it difficult to run queries against cards based on ranges of acceptable casting costs, so we should fix that while we model. While we know that costs are always numeric, how can we be sure that the data itself is strictly numbers?
We can look at the data we have and see what we see.
select distinct card->'cmc' from our_stg.raw_cards

From the dozen and a half rows we get back, we see two things:
- Casting cost is encoded as a decimal number in JSON(B)
- The underlying data is integral
The last point makes sense if you have any familiarity with the
game. With the exception of some deliberately whacky cards (that
are not legal in most games) you cannot spend half of a mana. This
gives us the peace of mind that we can cast the JSON floating
point number to a SQL integer. We can adjust our staged_cards
model definition accordingly:
select
card->>'id' as print_id,
card->>'oracle_id' as oracle_id,
card->>'name' as name,
card->>'type_line' as type_line,
(card->'cmc')::integer as converted_mana_cost,
card->>'rarity' as rarity_code
from
{{ ref('scryfall_data') }}
(The parentheses are necessary because the cast operator, ::
binds
tighter than the JSONB object dereference operator, ->
; the
parentheses “fix” the precedence and clarifies our intent.)
If we ever get a card object with a non-integral number, or even textual, non-numeric junk, our cast will fail noisily. Don’t believe me? Try these queries out in your favorite Postgres instance:
select 'foo'::integer;
select '4.56'::integer;
We’ll revisit this later when we implement tests on our landed sources.
Time for some more fields! Of particular interest is color identity. In the game of Magic, there are five colors: white, blue, black, red, and green. Each represents a different ideological faction in the game, and each has its own strengths and weaknesses. Blue, for example, is really good at card draw (for the blue player) and forced discard (for their opponents). Each card is associated with zero or more colors. Some cards have two or more colors associated with them; for instance a blue/red card is both red and blue.
When querying the pool of available cards, we will often want to restrict our results to cards based on these identities. Because colors can be combined, we need to be able to individually query each component of the color identity.
Scryfall stores this identity as a list of color codes, taken from the first initial of the color name and using ‘U’ for blue (black gets to be ‘B’):
select distinct card->'color_identity'
from our_stg.raw_cards
limit 15
What we need to do is split that apart into five separate columns, each asserting whether the card in question is considered “in” a given color. I want to be able to write:
select * from cards
where is_in_blue = true;
And find all blue cards.
To do that, we need a new PostgreSQL operator for list membership.
That operator’s name is ?
:
select '["a", "b", "c"]'::jsonb ? 'a' as has_a;
The above query returns true, because the JSONB list on the left
contains the string a
as an element. Our new select query for
color identity is then:
select
card->>'id' as print_id,
card->>'oracle_id' as oracle_id,
card->>'name' as name,
card->>'type_line' as type_line,
(card->'cmc')::integer as converted_mana_cost,
card->>'rarity' as rarity_code,
card->'color_identity' ? 'w' as is_in_white,
card->'color_identity' ? 'u' as is_in_blue,
card->'color_identity' ? 'b' as is_in_black,
card->'color_identity' ? 'r' as is_in_red,
card->'color_identity' ? 'g' as is_in_green
from
{{ ref('raw_cards') }}
Making this an intermediate model allows us to do things like this later on in the modeling pipeline:
select
*,
not(is_in_white or
is_in_blue or
is_in_black or
is_in_red or
is_in_green) as is_colorless
from
{{ ref('staged_cards') }}
This captures the aspect of being colorless, that is, having no color identity. Artifacts, which are not usually aligned with any faction or ideology–they are tools–are almost always colorless. This allows us to easily query for those.
We’ve made some progress in our quest to flatten the Scryfall data
into a relational model. We now have a staged_cards
model in dbt,
which is properly understood to be dependent on the raw_cards
view
we created at the outset of this task. We also have made the
source data more useful by providing some additional querying
handles–the is_in_X
fields–to help players find cards that match
their wants and needs.
A lot of the changes we made in this post rely on the underlying data itself having a given shape. We expect that there will be no sixth color added to the game any time soon, for instance. We have also implicitly leaked an assumption from our source data, via the `rarity_code field. In the next installment, we’ll leverage dbt’s excellent testing facilities to codify these assertions and tighten up the model.
Thanks for reading this installment of the blog series Exploring Semi-Structured JSON Data Sets With dbt. If you’ve struggling with ETL sprawl, trying to wrap your brain around ELT and its impact on the rest of your data operation, or just want to learn more about transforming your business using a solid SQL-based automation tool, then dbt may just be the thing you need.
In any event, you may want to read or re-read the other posts in the series. Here they are:
- Loading & Landing where we get the lab spun up, load the data, and do some clean up.
- Extracting JSON Fields in which we look at what’s in those JSON data blobs.
- Anticipating the Future by writing tests to detect new codes and schema changes.
- Re-Normalizing Data to pull a new model — card sets — out of card data.