dbt is fantastic. It’s mostly SQL, with enough convention and some templating thrown in to cover the gaps for things like dynamic column families, open-ended data-driven joins, etc.

To show off the sheer power and flexibility that dbt brings the data modeler, I want to explore some semi-structured data, and build up the dbt models and pipeline we need to flatten it into a relational, queryable model.

The data in question is from Scryfall, a search service built on top of the collectible trading card game Magic: the Gathering. I know–it’s nerdy–but bear with me because it is an approachable enough data set with some surprisingly deep domain-specific rules and logic. Not too unlike the internal systems and processes of data customers and employers.

The Scryfall Data Up Close

We’re going to dig into the JSON structure of the data that Scryfall exports, but first, let’s cover a bit of background about the game itself, and its most important concept: the card.

Screenshot of the scryfall.com home page and primary search interface

In Magic: the Gathering, two or more players battle it out by casting spells and summoning mythical, fantastical creatures. Spells, and the resources used to pay for them, are represented by cards. Wizards of the Coast, the Hasbro subsidiary that invented the game, has printed something like 25,000 distinct cards in the last 30 years. A far cry from the 52-card bicycle deck we all play Solitaire with.

Each card has a bunch of fixed, always-present attributes; things like card name, type of card (is it a spell, a land, a creature, etc.), casting cost, etc. As we dig deeper into the data, and build out our relational model, we’ll become more familiar with these attributes. Luckily, Scryfall has fairly extensive documentation, so you won’t have to be intimately familiar with every corner of the 200-page comprehensive rulebook.

A laboratory befitting a mad scientist

Before we can explore, we need a lab. For maximum reach on this exercise, we’re going to target a PostgreSQL instance running in a local Docker container. We’ll use Docker Compose to manage volumes, port forwarding, and environment variables so that we can focus on the data.

Here’s the code: https://gitlab.com/vivanti/blog/exploring-semi-structured-data-with-dbt.

To use it, clone the repository, enter the cloned directory, and issue the docker-compose up command:

→ docker-compose up
Starting scryfall-db     ... done
Starting scryfall-loader ... done
Attaching to scryfall-loader, scryfall-db
scryfall-db |
scryfall-db | PostgreSQL Database directory appears to contain a database; Skipping initialization
scryfall-db |
scryfall-db | 2022-07-19 14:29:45.838 UTC [1] LOG:  starting PostgreSQL 14.3 (Debian 14.3-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
scryfall-db | 2022-07-19 14:29:45.838 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
scryfall-db | 2022-07-19 14:29:45.838 UTC [1] LOG:  listening on IPv6 address "::", port 5432
scryfall-db | 2022-07-19 14:29:45.842 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
scryfall-db | 2022-07-19 14:29:45.859 UTC [26] LOG:  database system was shut down at 2022-07-19 14:29:41 UTC
scryfall-db | 2022-07-19 14:29:45.879 UTC [1] LOG:  database system is ready to accept connections
scryfall-loader | loaded /srv/scryfall/10E/cards1.json [29d956b2f7baacc5334629ff658d40cebdb97631]
scryfall-loader | loaded /srv/scryfall/10E/cards2.json [bb501917119832d6d6de9408f49476d11e1457bf]
scryfall-loader | loaded /srv/scryfall/10E/cards3.json [323aca28bd6899b174de102fb247cdfc02293987]

... etc ...

scryfall-loader | loaded /srv/scryfall/PLS/cards1.json [c22b1a0a04a7c7f7b2798797b39806975cf90f0b]
scryfall-loader | loaded /srv/scryfall/PRES/cards1.json [1360ed23e42057ca4e993cd417b737a1fcc77926]
scryfall-loader exited with code 0

You can validate it by pointing your favorite SQL workbench tool at 127.0.0.1:5432.

pgAdmin 4, connected to localhost on port 5432
pgAdmin 4 – A SQL workbench program for PostgreSQL databases

Landing the Semi-Structured JSON Data in the Database

When the Docker machinery spins up our RDBMS, it has a little bit of extra magic to get the JSON blobs from Scryfall into a table, scryfall.landed_json_blobs, with minimal fuss.

The table consists of four fields:

  1. The name of the file that the JSON blob came from
  2. A SHA-1 checksum of the contents of the file
  3. The JSON blob itself, as a JSONB field
  4. The date and time the blob was landed into the table

The Perl script that does the loading is barebones, but it does the job. You can read all 39 lines of it, if you like.

That checksum field in particular is handy because it keeps the loader from loading the same file multiple times. If a given file hasn’t been modified since the last time the loader ran, it will be skipped. This makes are loader marginally idempotent. We can blindly “load” all of the files, as many times as we want, and not worry about duplicates.

One thing the loader does not do is clear out this table when it runs. That’s by design, since we may want to additively load some subset of the full card pool. Wizards of the Coast release about four new sets of cards a year, so we’ll be iteratively building up our database over time.

Since we’re not clearing out the table, and we’re only idempotent for exact file matches–checksums are quite finicky after all–we have to deal with one final wrinkle in our landing process: what happens if Scryfall changes historical files?

This is not just possible, but it has happened many times. As the game evolves and cards get more complicated, Scryfall has to amend the format of their database. In 2011, Wizards introduced the world to dual-faced cards of the Innistrad block. Instead of having a normal card back, these cards had names and art and rules and such on both sides of the cardboard. Scryfall had to modify their card object structure to account for this.

The easiest way to de-duplicate the landed blobs is with a view, and we’re going to use dbt to manage that view and its lifecycle.

Note: you can start up a new dbt project with the dbt init command, and I’d normally recommend that wholeheartedly. However, in the interest of editorial continuity, I’m going to build out the files and directories by hand.

Let’s start with a blank(-ish) dbt project:

$ mkdir ~/code/scryfall
$ cd ~/code/scryfall
$ mkdir -p {models,tests,seeds}/staging
$ vim dbt_project.yml

When I say vim, please substitute your editor of choice, be that emacs, vscode, Notepad, whatever. Here’s what you’re going to put in your dbt_project.yml file:

name: scryfall
version: 1.0.0
config-version: 2
profile: localdev

model-paths: [models]
test-paths:  [tests]
seed-paths:  [seeds]

models:
  scryfall:
    staging:
      +schema: stg

seeds:
  scryfall:
    staging:
      +schema: stg

That will keep our staging database objects in a schema called something like some_prefix_stg — that’s going to be critical as we invite other data engineers to experiment with our transformations.

Next, we need to define that localdev profile in our ~/.dbt/profiles.yml file. It should look like this:

# note: there may be other stuff in this file; that’s okay
localdev:
  target: dev
  outputs:
    dev:
      type: postgres
      threads: 4
      host: 127.0.0.1
      port: 5432
      user: postgres
      pass: explore!
      dbname: explore
      schema: our

Now, the fun part: the view that will de-duplicate our landed blobs:

with
partitioned as (
  select
    sha1sum,
    row_number() over(partition by sha1sum
                      order by loaded_at desc
                     ) as row_number
   from scryfall.landed_json_blobs
),

most_recent as (
  select sha1sum
  from partitioned
  where row_number = 1
)

select blobs.*
  from scryfall.landed_json_blobs blobs,
       most_recent
 where blobs.sha1sum = most_recent.sha1sum

You can run that in your SQL tool of choice and it should return 378 rows.

pgAdmin showing the results of the de-duplicating query
Here’s what I see when I run that query

We’ll use this result set as the basis for the rest of our cleaning, validation, and transformation pipeline, so we will want to keep it around. To do that, we can drop the SQL query, unmodified, into the file models/staging/scryfall_data.sql, and issue a dbt run:

→ dbt run
15:53:12  Running with dbt=1.1.0
15:53:12  Partial parse save file not found. Starting full parse.
15:53:13  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- seeds.scryfall.staging

15:53:13  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
15:53:13
15:53:13  Concurrency: 4 threads (target='dev')
15:53:13
15:53:13  1 of 1 START view model our_stg.scryfall_data .................................. [RUN]
15:53:13  1 of 1 OK created view model our_stg.scryfall_data ............................. [CREATE VIEW in 0.14s]
15:53:13
15:53:13  Finished running 1 view model in 0.42s.
15:53:13
15:53:13  Completed successfully
15:53:13
15:53:13  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

It is safe to ignore that WARNING about seeds.scryfall.staging configuration not being used yet. We’ll get to that later in the series.

Once this completes successfully, we can see a new view in the database, under the our_stg schema:

pgAdmin showing our brand new schema and view
Note: you may have to right-click on the explore database and select “Refresh” to get the new schema to appear

That’s it. Today, we’ve gotten our hands on a real, complicated JSON data set, we’ve managed to set up a Docker environment for exploring that data, and used dbt to set up a view that de-duplicates landed JSON blobs, with a preference for only the freshest of data. That’s impressive!

Next time, we’ll start digging into those JSON blobs, using the full power and awesomeness of the built-in PostgreSQL JSONB functions and operators. We’ll learn a little about the game, its cards, and we’ll get build our first staging table to house cleaned card data.

If You Liked That, You’ll Love These:

Building a new kind of consultancy

Building a new kind of consultancy

At Vivanti, we’re building a new type of cloud consultancy. One based on trust and empowerment. We are looking for savvy, technical people who want to forge their own path in the industry.