Exploring Data With dbt: Extracting JSON Fields

James Hunt

In the last essay, we got our lab environment spun up (thanks to the wonders of Docker Compose), set up our dbt project, and wrote our first model to de-duplicate landed JSON blobs. Today, we’re going to roll up our sleeves and dive right into the JSON objects we’re getting out of Scryfall.

In this series on exploring data, we use dbt to plumb the depths of a semi-structured JSON data set with the help of PostgreSQL’s JSONB datatype and related functions and operators. Throughout the series we tackle everything from simple field cardinality, to future-proofing ingestion pipelines, to multi-value columnar re-assembly.

If you found us via search, and want a little background and context, you may want to read the whole series from the beginning:

  1. Loading & Landing where we get the lab spun up, load the data, and do some clean up.
  2. Extracting JSON Fields in which we look at what’s in those JSON data blobs.
  3. Anticipating the Future by writing tests to detect new codes and schema changes.
  4. Re-Normalizing Data to pull a new model — card sets — out of card data.

What’s in those JSON blobs?

As I try to wrap my brain around a new wild and woolly data set, I find myself running small experiments in something like pgAdmin or DataGrip. Those small experiments either fizzle out quickly, or evolve into larger and more complete queries, eventually becoming permanent fixtures of my dbt pipeline. I’ll be following the same path as we work through exploring the Scryfall data.

First up, let’s pick a loaded blob at random and look at the JSON payload, in all of its glory and splendor:

select data from our_stg.scryfall_data limit 1
Results of interrogating a single JSON blob for its internal structure

That’s a lot of stuff. The first thing I notice, however, is that our returned value is a JSON object–not a scalar, and definitely not a list. That means we can extract just the top-level key names and look at those:

select distinct
  jsonb_object_keys(data) as key
  from our_stg.scryfall_data
The five distinct top-level keys from all JSON blobs, everywhere

There’s only five keys, which is quite manageable. Let’s see what the cardinality on all of those metadata keys (i.e. not payload itself) look like.

select distinct data->'has_more'
  from our_stg.scryfall_data
Results of running the cardinality SQL query against the 'has_more' top-level key

As is probably obvious in retrospect, the has_more field is a Boolean and only ever takes on two values: true or false. What about the object key?

select distinct data->'object'
  from our_stg.scryfall_data
Results of running the cardinality SQL query against the 'object' top-level key

Well, at least we’re dealing with homogenous blobs; every single one of them is an encoded list of other things.

select (distinct data->total_cards)::integer as value
  from our_stg.scryfall_data
 order by value desc
Results of running the cardinality SQL query against the 'total_cards' top-level key

Now we’re getting somewhere. While we are dealing exclusively with lists, not all of the lists are the same length. Some of these blobs hold close to a thousand card definitions, while others are barely lists at all.

Feel free to run the same analysis on the next_page field, but it will have a high cardinality since each blob represents a single page of search results from the Scryfall REST API.

It’s time to look at the data top-level key, which we’re pretty sure will be a list. We can verify that with the json_typeof function:

select distinct jsonb_typeof(data->'data')
  from our_stg.scryfall_data

Note: if PostgreSQL spits back an error about a json function not existing, check that you are calling the jsonb_ variant, instead of the json_ variant. PostgreSQL matches up functions based on their name and their argument types / return type. Since json_ functions are only defined for json type inputs, and jsonb_ functions are only defined for jsonb type inputs, this error is way more common than you would think.

Results of running the type-of SQL query against the 'data' top-level key

Everything’s an array, which agrees with our understanding of the top-level object field, and gives us our next step: unnesting the arrays into result sets!

select jsonb_array_elements(data->'data')
  from our_stg.scryfall_data
Results of unnesting the 'data' array, across all de-dedupliated blobs

That’s not a bad-looking table. Each row represents a single card, and the artificial boundaries between the blobs have disappeared. If we want to, we can include lineage as well:

select filename  as source_file_name,
       sha1sum   as source_file_sha1sum,
       loaded_at as source_file_loaded_at,
       jsonb_array_elements(data->'data')
         as card
  from our_stg.scryfall_data
Results of adding blob lineage to the unnested card data

The unnesting behavior of the jsonb\_array\_elements function acts like a sort of inner join, duplicating the singleton fields of each blob in scryfall\_data once for each element of the data array in each JSON payload.

If we use a common table expression, we find we can even pull things out of the unnested JSON card object and store those as new columns. Let’s pull out three identifying bits of information for each card: its Print ID, its Oracle ID, and its name:

with cards as (
  select filename, sha1sum, loaded_at,
         jsonb_array_elements(data->'data')
           as card
    from our_stg.scryfall_data
)
select card->>'id'        as print_id,
       card->>'oracle_id' as oracle_id,
       card->>'name'      as name,

       filename  as source_file_name,
       sha1sum   as source_file_sha1sum,
       loaded_at as source_file_loaded_at,
       card
  from cards

The ->> there is not a typo; it extracts the key on the right from the object on the left and coerces it to a textual value. That is a->b pulls value of the b key out of a as a JSONB object, whereas a->>b pulls the value of the b key from a as TEXT.

Now we have a comprehensive enough (and complicated enough!) query to persist into our dbt pipeline. We’ll call this one raw\_cards by putting the latest incarnation of the SQL into models/staging/raw_cards.sql and running a specific, targeted dbt run again (more on this in a little bit).

→  dbt run --select staging.raw_cards
17:02:51  Running with dbt=1.1.0
17:02:51  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging

17:02:51  Found 2 models, 0 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0
metrics
17:02:51
17:02:52  Concurrency: 4 threads (target='dev')
17:02:52
17:02:52  1 of 1 START view model our_stg.raw_cards ...................................... [RUN]
17:02:52  1 of 1 OK created view model our_stg.raw_cards ................................. [CREATE VIEW in 0.06s]
17:02:52
17:02:52  Finished running 1 view model in 0.25s.
17:02:52
17:02:52  Completed successfully
17:02:52
17:02:52  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

This creates another view in the our_stg schema:

pgAdmin showing our new 'raw_cards' view

A small note on inter-model references. If you try to issue a full dbt run at this point, there is a good chance your raw_cards view will go missing. Here’s why: when dbt looks at the two SQL files we’ve told it define our models, it sees nothing that links them in any dependent way. Unfortunately, they are dependent; raw_cards uses scryfall_data in its common table expressions. dbt, unaware of this fact, attempts to create both views at the same time. It’s a race condition. If raw_cards is created first, then dbt will drop the scryfall_data (with CASCADE) and obliterate the raw_cards view it just created.

To fix this, we need to teach dbt about this dependency, and we do that by swapping out the literal “schema.table” reference with some dbt special sauce: {{ ref('model-name') }}:

with cards as (
  select filename, sha1sum, loaded_at,
         jsonb_array_elements(data->'data')
           as card
    from {{ ref('scryfall_data') }}
)
select card->>'id'        as print_id,
       card->>'oracle_id' as oracle_id,
       card->>'name'      as name,

       filename  as source_file_name,
       sha1sum   as source_file_sha1sum,
       loaded_at as source_file_loaded_at,
       card
  from cards

We’ll get more into the why later on in the series, so just trust me on this trick for now.

One last thing before we take a break for today: the de-duplicated blobs view is just that: a view. If someone else were to come along and start loading in data, or worse–delete some entries, our analysis would immediately reflect that. After all, we haven’t actually copied the fresh blobs into a separate table. We’re just filtering with a named query!

Luckily, dbt is just as able to build tables as views, it just happens to err on the side of more compute (longer queries) instead of increased storage cost. In dbt parlance, this is called the model materialization and there are four options:

  1. view, the default
  2. table, built via a create table as select … query (CTAS)
  3. ephemeral, which we won’t be covering until later
  4. incremental, which we won’t be covering at all

There are many ways to change the materialization of models, but the easiest is to manually override it on a model-by-model basis. We do this with some Jinja (the dbt templating language) in the SQL file that defines the model. Up at the top of the models/staging/scryfall_data.sql file, before everything else, put this little bit of code:

{{
  config(
    materialized='table'
  )
}}

The {{ starts Jinja mode, which }} ends. While in that mode, config() is a call that will override configuration for the current model. In our case, we’re changing how the model is materialized, switching from view (the default) to table. Now, when we dbt run:

→  dbt run
17:10:12  Running with dbt=1.1.0
17:10:12  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging

17:10:12  Found 2 models, 0 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0
metrics
17:10:12
17:10:12  Concurrency: 4 threads (target='dev')
17:10:12
17:10:12  1 of 2 START table model our_stg.scryfall_data ................................. [RUN]
17:10:29  1 of 2 OK created table model our_stg.scryfall_data ............................ [SELECT 378 in 16.95s]
17:10:29  2 of 2 START view model our_stg.raw_cards ...................................... [RUN]
17:10:29  2 of 2 OK created view model our_stg.raw_cards ................................. [CREATE VIEW in 0.13s]
17:10:29
17:10:29  Finished running 1 table model, 1 view model in 17.25s.
17:10:29
17:10:29  Completed successfully
17:10:29
17:10:29  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

We can see that scryfall_data is a table, not a view, and we’re safely protected from upstream changes until we run the whole pipeline.

Let’s wrap up, since the next step is a big one. Today, we’ve gotten our de-duplicated blobs from the landing area into the staging area (as a table!), and built our first model to tease apart the aggregate card lists present in each blob. Now we have a table that contains one record per printed card in the game, and we’re even tracking data lineage back to the original files! Next up: extracting even more data from each JSON card object into a staged cards table!

Let’s pull out some card data, shall we?

The stage is set, the actors know their lines and blocking. We have a workable dbt project, some landed JSON data, and our first round of cleaning and extraction, vis-a-vis the raw_cards table. It’s time to extract some data from the JSON.

Whenever I’m dissecting a bit of JSON, I like to understand what the top-level keys are. The keys tell you a lot about what the designers in the other system felt were important, either from a domain perspective, or from a frequency of reference standpoint. Once these are known, and their data types validated, we can recurse into sub-objects and run the same process. Luckily, this data set comes with some decent document [link] from the fine folks over at Scryfall.

We’ll start with this exploratory SQL:

with keys as (
  select jsonb_object_keys(card) as key
    from our_stg.raw_cards
)

select key, count(*) as how_many
  from keys group by 1
Results of running our exploratory SQL query against card top-level keys

From this analysis, we can find out just how many unique keys there are (there’s 82 of them), if they are uniformly used (they are not), and just how non-uniform they are. For instance, half (47) of the 82 keys are found on each and every one of the 37,411 cards.

That’s actually pretty good.

Let’s start with the basics. Every Magic card has, at a bare minimum, the following:

We can easily stage these into a table, casting strings to text fields and numeric quantities as appropriate.

select
  card->>'id'          as print_id,
  card->>'oracle_id'   as oracle_id,
  card->>'name'        as name,
  card->>'type_line'   as type_line,
  card->>'cmc'         as converted_mana_cost,
  card->>'rarity'      as rarity_code
from
  our_stg.raw_cards

We can call this model a staged card and store the SQL in models/staging/staged_cards.sql. To speed things up (as we will be querying this model a lot), let’s materialize it as a table. Here’s the full staged_cards.sql source:

{{
  config(
    materialized='table'
  )
}}

select
  card->>'id'          as print_id,
  card->>'oracle_id'   as oracle_id,
  card->>'name'        as name,
  card->>'type_line'   as type_line,
  card->>'cmc'         as converted_mana_cost,
  card->>'rarity'      as rarity_code
from
  {{ ref('raw_cards') }}

That bit at the end where we select from a curly-brace clad ref() call is a little bit of dbt making life easier for us. We covered this briefly towards the end of the last post. So far, we’ve hard-coded schema and table names. This has worked, but only because we are a single data engineer, and we are aware of the underlying structure we are asking dbt create. As soon as anyone else on our team checks out this code to try an experiment, they will either run afoul of RBAC (if we’re lucky) or trample all over our stuff (if we’re not).

The ref() Jinja call asks dbt to figure out what schema and what table implements the named model. Often the table names will match, but the schema names are almost never knowable. Using ref(), we don’t have to worry about either, and we can let dbt handle it.

Another great thing about using references like this is that dbt can then learn about the dependencies between models. It can then use this information in determining order of operations (who gets materialized first?), and downstream propagation (if I change table X, who needs to be re-materialized?). Using ref() has no downsides, and all the upsides. We’ll be using it from here on out.

Alright, back to our fledgling staged_cards table. I’m assuming at this point you’ve run dbt run again to pick up the new table. If not, do so now.

If we inspect the definition of the table in PostgreSQL, we’ll see that the converted mana cost column is textual, not numeric. That’s going to make it difficult to run queries against cards based on ranges of acceptable casting costs, so we should fix that while we model. While we know that costs are always numeric, how can we be sure that the data itself is strictly numbers?

We can look at the data we have and see what we see.

select distinct card->'cmc' from our_stg.raw_cards
Results of running our converted mana cost cardinality query against our staged card data

From the dozen and a half rows we get back, we see two things:

  1. Casting cost is encoded as a decimal number in JSON(B)
  2. The underlying data is integral

The last point makes sense if you have any familiarity with the game. With the exception of some deliberately whacky cards (that are not legal in most games) you cannot spend half of a mana. This gives us the peace of mind that we can cast the JSON floating point number to a SQL integer. We can adjust our staged_cards model definition accordingly:

select
  card->>'id'             as print_id,
  card->>'oracle_id'      as oracle_id,
  card->>'name'           as name,
  card->>'type_line'      as type_line,
  (card->'cmc')::integer  as converted_mana_cost,
  card->>'rarity'         as rarity_code
from
  {{ ref('scryfall_data') }}

(The parentheses are necessary because the cast operator, :: binds tighter than the JSONB object dereference operator, ->; the parentheses “fix” the precedence and clarifies our intent.)

If we ever get a card object with a non-integral number, or even textual, non-numeric junk, our cast will fail noisily. Don’t believe me? Try these queries out in your favorite Postgres instance:

select 'foo'::integer;
select '4.56'::integer;

We’ll revisit this later when we implement tests on our landed sources.

Time for some more fields! Of particular interest is color identity. In the game of Magic, there are five colors: white, blue, black, red, and green. Each represents a different ideological faction in the game, and each has its own strengths and weaknesses. Blue, for example, is really good at card draw (for the blue player) and forced discard (for their opponents). Each card is associated with zero or more colors. Some cards have two or more colors associated with them; for instance a blue/red card is both red and blue.

When querying the pool of available cards, we will often want to restrict our results to cards based on these identities. Because colors can be combined, we need to be able to individually query each component of the color identity.

Scryfall stores this identity as a list of color codes, taken from the first initial of the color name and using ‘U’ for blue (black gets to be ‘B’):

select distinct card->'color_identity'
  from our_stg.raw_cards
 limit 15

What we need to do is split that apart into five separate columns, each asserting whether the card in question is considered “in” a given color. I want to be able to write:

select * from cards
 where is_in_blue = true;

And find all blue cards.

To do that, we need a new PostgreSQL operator for list membership. That operator’s name is ?:

select '["a", "b", "c"]'::jsonb ? 'a' as has_a;

The above query returns true, because the JSONB list on the left contains the string a as an element. Our new select query for color identity is then:

select
  card->>'id'             as print_id,
  card->>'oracle_id'      as oracle_id,
  card->>'name'           as name,
  card->>'type_line'      as type_line,
  (card->'cmc')::integer  as converted_mana_cost,
  card->>'rarity'         as rarity_code,

  card->'color_identity' ? 'w' as is_in_white,
  card->'color_identity' ? 'u' as is_in_blue,
  card->'color_identity' ? 'b' as is_in_black,
  card->'color_identity' ? 'r' as is_in_red,
  card->'color_identity' ? 'g' as is_in_green
from
  {{ ref('raw_cards') }}

Making this an intermediate model allows us to do things like this later on in the modeling pipeline:

select
  *,
  not(is_in_white or
      is_in_blue  or
      is_in_black or
      is_in_red   or
      is_in_green) as is_colorless
from
  {{ ref('staged_cards') }}

This captures the aspect of being colorless, that is, having no color identity. Artifacts, which are not usually aligned with any faction or ideology–they are tools–are almost always colorless. This allows us to easily query for those.

We’ve made some progress in our quest to flatten the Scryfall data into a relational model. We now have a staged_cards model in dbt, which is properly understood to be dependent on the raw_cards view we created at the outset of this task. We also have made the source data more useful by providing some additional querying handles–the is_in_X fields–to help players find cards that match their wants and needs.

A lot of the changes we made in this post rely on the underlying data itself having a given shape. We expect that there will be no sixth color added to the game any time soon, for instance. We have also implicitly leaked an assumption from our source data, via the `rarity_code field. In the next installment, we’ll leverage dbt’s excellent testing facilities to codify these assertions and tighten up the model.

Thanks for reading this installment of the blog series Exploring Semi-Structured JSON Data Sets With dbt. If you’ve struggling with ETL sprawl, trying to wrap your brain around ELT and its impact on the rest of your data operation, or just want to learn more about transforming your business using a solid SQL-based automation tool, then dbt may just be the thing you need.

In any event, you may want to read or re-read the other posts in the series. Here they are:

  1. Loading & Landing where we get the lab spun up, load the data, and do some clean up.
  2. Extracting JSON Fields in which we look at what’s in those JSON data blobs.
  3. Anticipating the Future by writing tests to detect new codes and schema changes.
  4. Re-Normalizing Data to pull a new model — card sets — out of card data.