Sometimes, we get data in its natural, normalized form. More frequently, however, we get denormalized data, where some attribute or identity that exists in its own right has been embedded in related data. When modeling for enterprise concepts (nouns, not verbs) we will often want to renormalize.

In our case, the Scryfall data has been denormalized, and we need to renormalize it. That’s what we’re going to do today, now that we have more or less finished modeling our card data.

Extracting Set Data

Magic cards are distributed together into named sets. Sets are usually centered around a particular theme, event, or fictional plane of existence. For example, later this year, Wizards is releasing the Dominaria United set, which will take place on the plane of Dominaria and explores some of the lore and history of one of the game’s oldest environs.

Sets intersect with the legality stuff we dealt with in the last installment. Standard format magic is defined by the rotating roster of sets that are standard-legal. Some sets, like the Modern Horizons sets (MH1, and MH2) are explicitly not standard-legal; they go directly into the Modern card pool without disrupting Standard tournament play.

Excluding supplemental and promotional sets, there are over 140 different sets. All of that information is embedded in our card details, and we’d like to extract those bits of data and create a new model. We’ll start by adding the set attributes to our staged_cards model.

select
  -- all the other other fields ...
  set as set_code
from
  {{ ref('raw_cards') }}

We can also extract the set data directly from our input table, in a new model:

select distinct
  card->>'set'                  as code,
  card->>'set_name'             as name,
  card->>'set_type'             as type,
  (card->>'released_at')::date  as release_date
from
  {{ ref('raw_cards') }}

We’ll put this SQL in models/staging/staged_sets.sql.

Behind the scenes, we’re fairly certain that Scryfall has set data in third normal form, and is expanding it out for us in our extract. Indeed, their documentation supports this theory. However, at face value, we cannot in good faith just accept this as a given. It may be true today, but who knows what tomorrow brings.

That means we’re back to writing tests.

Our select query for the staged_sets model is built on the premise that all set attributes specified in the card JSON data are identical for the same set code. That is, MIR is always the Mirage expansion set, and every card in MIR was released on October 8th, 1996. We can write a simple schema test to verify that:

with duplicates as (
  select code
    from {{ ref('staged_sets') }}
   group by 1 having count(*) > 1
)
select code, name, type, release_date
  from {{ ref('staged_sets') }}
 where code in (select code from duplicates)

Surprisingly, this kicks out more than a few duplicate sets! Clearly, our assumptions were wrong. Let’s go through them.

Results stored from our failed 'dbt test' run, showing which set codes had multiple distinct release dates

We’re dealing with four problematic sets. Two of them (ALA and IKO) are statistical outliers. If you look at the data, only one card out of each has a different (later) release date. I think we can go with the majority opinion on those two.

The other two are very peculiar sets. With Mythic Edition (MED), Wizards started specifically courting the higher-end of their player base with product specifically designed to be expensive. The first Mythic Edition was sold as a box of 24 sealed draft boosters assembled from the Guilds of Ravnica set (GRN), alongside 8 new, mythic-rarity (i.e. incredibly rare) cards that were exclusive to the MED set. That box MSRP’d at ~$250 dollars, which was a steep premium over normal draft booster boxes which retailed for ~$150 and include 30% more cards.

They did that experiment two more times, once for Ravnica Allegiances (RNA), and War of the Spark (WAR) sets. Each time, the eight additional cards were also marked with the set symbol for Mythic Editions. In effect, they released a 24-card set in three different installments, on three different dates. I think we can just pick the earliest date and be done with it.

(Wizards concluded the experiment, and announced that all future such “premium product” would go through a separate vehicle known as “Secret Lair”, so we shouldn’t see those in mainstream card sets in the future.)

That leaves PRES, a catch-all rolling set, much like Mythic Edition, that was first launched in 2007. That’s what Scryfall says at least

In all four of these specific cases, we can note the oddity, and then use the earliest date on record. We can do that by separating the well-behaved sets from the problematic ones, providing the special-case logic, and then combining them back in with a union operation.

Here’s our newly improved staged_sets model query:

select distinct
  card->>'set'                  as code,
  card->>'set_name'             as name,
  card->>'set_type'             as type,
  (card->>'released_at')::date  as release_date
from
  {{ ref('raw_cards') }}
where
  card->>'set' not in (
    'med',  -- 3 Mythic Edition sets treated as one
    'pres', -- Rolling "catch-all" set for presales
    'iko',  -- Zilorth was re-released later
    'ala'   -- Rafiq was re-released later
  )

union

select
  card->>'set'                     as code,
  card->>'set_name'                as name,
  card->>'set_type'                as type,
  min(card->>'released_at')::date  as release_date
from
  {{ ref('raw_cards') }}
where
  card->>'set' in (
    'med',  -- 3 Mythic Edition sets treated as one
    'pres', -- Rolling "catch-all" set for presales
    'iko',  -- Zilorth was re-released later
    'ala'   -- Rafiq was re-released later
  )
group by 1, 2, 3

Now, when we run our test, we pass with flying colors! We’ve successfully handled the known special cases today, without leaving the door open for misapplication of the fix to future problems. Win!

All things considered, we’ve done a pretty thorough workup on this data set, learned a bit about the possibilities of JSON embedded in PostgreSQL (although you can do similar things in Snowflake, too), and used most of dbt’s features to get there. What’s next? That’s up to you; what data set will you go spelunking in?

In the next (and final!) essay in this series, we’re going to look at our final model, cards, and talk about some of the more interesting aspects of it from a query-writer’s perspective. Why do we have the columns we have, why did we name them as such, etc.

If You Liked That, You’ll Love These: