Transform Your Data with dbt and SQL

Whenever you work with data, use SQL, as Codd intended.

Data is rarely in the form you need it. Dates are formatted in way too many ways, codes differ between systems, and handling people’s identities is hard. Even if you did get all of your data delivered to your warehouse in nice, clean, and well-behaved forms, your work would not yet be done.  You’d still need to tease apart facts to get the right grain. You’d still need to extract the dimensions you want.

Besides, even with the cleanest data you can reasonably ask for, you’re going to have to do some transformation work. And for that, you need SQL, augmented with the power of dbt.

What is dbt?

Agile DevOps methodology comes to data engineering

dbt is the Swiss-Army chainsaw of data manipulation and transformation. It puts tools and practices from the world of Agile Software Engineering into the hands of the data engineer. With dbt, we can start with nothing but the raw data and perform every other transformation we need. At the end, we get a nice, clean fact/dimensional model fit for an analyst armed with a BI tool.

In Vivanti’s opinionated architecture for data warehousing, dbt takes a front and center role as the primary engine of change.

We’ll be at Coalesce 2022!

dbt labs is hosting Coalesce 2022 – the analytics engineering conference – in New Orleans, Louisiana on October 17-21st. Join Vivanti at the Hyatt Regency for a full week of conference talks, hallway tracks, swag grabs, and expo floor conversations.

Are you going to be at Coalesce? Fill out the form below and we’ll meet up for coffee (or tea, if you prefer) to talk shop; discuss what you’re doing with dbt, chat about challenges and successes, and generally nerd out about SQL transformation.

The great thing about dbt Is…

Well, actually, there’s like five really great things vying for top spot:

It is (just) SQL with templating

SQL is the language of data. The fact that we wasted decades pretending otherwise – concocting strange ETL mini-languages, XML pipeline formats, and a whole swath of tools around avoiding SQL – is regrettable. But now, armed with nothing but a Python interpreter (which we’re sure your data team already has) and dbt, we can get back to manipulating data with SQL, as Codd intended.

dbt is more than just SQL though! With the embedded templating language Jinja, analytics engineers now have the ability to write conditional statements and looping constructs. With Jinja, we can build variable-column models, package-up reusable SQL in dbt macros, and essentially rewrite the SQL compiler as we see fit. dbt brings us all this; without having to leave SQL behind.

Collaboration

Teamwork makes the dream work. At least, that’s what Stephanie in sales always tells us. With dbt and a version control system like git, data engineering teams can collaborate seamlessly like never before. More than ever before, data pipeline development proceeds like a software engineering project.

From feature branches to pull requests – from code review to merge – dbt gives data engineers the tools they need to adapt to quickly changing business demands.

Collaboration is as much about not getting in each other’s way as it is about working together.  Through clever use of schemas and naming conventions, multiple data engineers can develop different features in parallel, without stepping on each other’s toes. With dbt, nothing stands in the way of data pipeline experimentation.

Tests, tests, and more tests

Transformation is, at its core, about making a call.  Executive decisions.  How do we map these codes and identifiers from one system to another?  What’s the ‘canonical’ form of a street address? What is the source of truth for customer demographic data? Analytics engineers make decisions like these regularly, but they never do so in a vacuum.  There’s always a premise and a set of assumptions underlying each and every data modification.

Wouldn’t it be great to surface those assumptions, and validate them with each and every run of the data pipeline? That’s precisely what dbt tests do. Is customer status always going to be either ‘active’ or ‘inactive’? Write a test! Sales figures from nightly batch loads should never be negative, right? Write a test!

Data pipelines with integrated tests catch more silly warehousing errors than pipelines without.  dbt makes testing so effortless you almost have to go out of your way to not test.

Transparency & repeatability

The biggest problem with ETL tools is that they don’t think in SQL.  The second biggest problem is that they run business logic outside of the confines of the database system; with little oversight and almost no visibility into what they have done and why.

This leads almost effortlessly into the next big benefit: repeatability.  In the bad old days, ETL processes would have to be reprocessed from transferred files.  If you were careful, you could usually find and drop the bad data that the ETL tool inserted at the culmination of its run.

In dbt, on the other hand, we model transformations as a set of SQL ‘select’ statements that iteratively refine the model.  Sure, some of them are complicated and scary – we once wrote a 5,000 case statement – but they are non-destructive.  Data analysts and engineers can always trace back through the lineage of transformation to determine where a particular problem emerged. Was it source data? A transformation built on an assumption that is no longer valid?

With dbt’s modeling ethos, we can easily find out.

It even writes the documentation for you!

Producing quality documentation that is both accessible and accurate is hard.  Having to update cross-references, maintain model dependencies, and track columnar data types makes it all the more difficult.  dbt’s ability to generate an entire static site, based on the dbt code you write, is huge. The fact that you can easily host that site on Amazon S3 or Google Cloud Storage is also nice.

As far as a bare bones data dictionary goes, what dbt gives you out of the box with little effort is supremely usable.  What’s more, it only takes a little bit of effort to provide overarching architectural notes on the why of a particular model.

Data Modeling

Turning data → information → knowledge [→ insight]

Modeling is the process of taking raw data – put into the warehouse as-is by something like a [Fivetran] – and incrementally cleaning, contextualizing, categorizing, and classifying it.  Modeling turns data into information and ultimately knowledge.  At Vivanti, we use a zonal discipline when we model data.

Landing

Data is brought into the warehouse with minimal modification, often as JSON (semi-structured), or CSVs (structured).  We use Fivetran for this.

Staging

Landed data is tagged with data lineage markers and staged, away from the landing zone, allowing other load processes to resume.

Historicization

Staged data is filtered and only new data is brought into the pre-modeling phase.  This is often referred to as “changed data capture”.

Cleaning

Textual data gets scrubbed, date formats normalized; values homogenized.  Consistency is enforced at this stage, field-wise.

Modeling

Data from multiple systems is reconciled and normalized.  It is then aggregated and split back out to form the fact / dimension tables and views.

Some modeling is mechanical and completely disconnected from the nature of the data.  The staging and historicization zones are largely independent of the precise nature of the data.  Once we get to the cleaning zone, however, it takes a deep understanding of the underpinning business processes to model properly.

Vivanti leverages dbt to model with confidence, and deploy with certainty.

We model every zone from staging onward with dbt.  This gives you all the benefits of dbt – testing, collaboration, reproducibility – anywhere and everywhere you’re changing data into information.  We model with confidence.

At deployment, we use dbt, combined with a version control system and a continuous integration / continuous delivery pipeline tool to build out environments.  Version control, like Git, keeps track of all the different versions of our dbt code, and provides distribution via ‘git clone.’  CI/CD – whether you use Azure DevOps, Jenkins, GitLab Runner, or GitHub Actions – handles the repetitive execution of the dbt code with fidelity and single-mindedness.

Every change to the data pipeline is first run in a dev environment, by the analytics engineer responsible for shepherding the change through to production.  The cycle of edit→compile→test is extremely tight.  This helps work out the bugs and kinks quickly, before involving peers and test environments.

Next up is the test environment, which mimics production data and system configuration as far as it makes sense.  We’ll never get 100% faith and fidelity in reproducing production, but we can get close.  Tests are carried out automatically, as code changes are pushed to the central git repository.  The results of these tests inform code review and merge processes.

Once the code is merged, it gets deployed to production.  This continuous delivery of new features into production ensures that teams don’t get bogged down dealing with large and unruly changesets being deployed into production.  Issues are caught sooner, when the development process is still fresh in the mind of the analytics engineer who can best fix them.

dbt enables large-scale Agile and DevOps practices for the data world.

Leave your ETL tools behind

Instead, transform your data inside the warehouse, using SQL and dbt

ETL had a good run, but now we’ve got cheap storage on an unprecedented scale, and commodity compute resources at our fingertips.  Database engines have advanced to the point where we don’t need the likes of Informatica, Talend, or SSIS.

Snowflake is more than capable of grinding through our transformative SQL queries, conjuring up new storage as needed and charging us only for the usage.  With incremental materialization, even large, billion-row tables can be managed cost-effectively through a dbt model.

Vivanti can help you move off of your current ETL toolchain, porting your transformations over to more holistic, transparent, and reproducible dbt pipelines.  Wherever your data is going, we can get it there with dbt.

What do we find when people drop ETL in favor of ELT and dbt?

 

Speed

For starters, pipelines run faster.  dbt is a lightweight SQL compiler and most data modeling execution time is spent actually running SQL.  Since all ETL tooling will be bound by this eventuality, we find that dbt pipelines are just quicker.

Flexibility

ETL is a commitment.  Since most ETL tools discard their inputs and working state after every run, you don’t have many opportunities to “get it right.” With ELT, you keep everything, so you can make small tweaks and adjustments as needed.

Ownership

Analytics engineers absolutely love dbt: it’s powerful enough to be useful, but not annoying enough to be intrusive.  We find people feel more connected to their dbt transformations, because when you own something, you take greater care of it.

Targeted Proof-of-Concept

Sometimes, you just need to show that the whole thing works as advertised.

The promises of the cloud data platform – especially enabled by the power of dbt – are compelling. But, it can be difficult to see just how you get from where you are today, to where you want to go. With a Vivanti targeted Proof-of-Concept, we present a time-boxed, scope- and cost-controlled engagement.The focus? Delivering an executive business use case for a larger, more comprehensive, and more impactful project.

 

If you’re unsure about how well dbt will transform your data operations, get in touch and we can discuss a targeted PoC to help reduce that doubt.

We’ll run your data pipelines

Free up your team to focus on driving innovation and uncovering new insight.

If you’re not big on internal software development and delivery, chances are you don’t just have a continuous integration / continuous delivery workflow engine sitting around.  That’s ok.  We have years of experience owning and operating CI/CD solutions, from Jenkins and Concourse to GitHub Actions and GitLab runners.

We can run the platform for your data pipelines, and tend to the care and feeding of the pipeline jobs themselves, freeing you up to allocate effort and attention where it needs to be: solving new business problems.

Our managed data pipeline offering includes everything from deployment and configuration to monitoring and optimization.

We watch:

    • Job success / failure
      When pipelines fail, we figure out why and fix the problem.
    • Job freshness
      We expect data pipelines to run continuously, or at least regularly.
      When that doesn’t happen, we get notified so we can correct it.
    • Job health metrics
      We gauge the health of pipeline jobs from multiple perspectives: Everything from execution lag (is this pipeline taking longer than normal), to recovery time (how long on average does it take for a job to succeed after failing), and mean time between failures.

Using metrics collected from job runs, we can optimize the slowest parts of your data pipeline, speeding up ingestion, transformation, and modeling efforts.

We are cloud automation junkies

Automation is in our bones. We’ve automated everything – from code review tests, to entire infrastructure deployments and platform scale-outs – managing thousands of machines and hundreds of thousands of applications.

Your data pipelines are in expert hands, under watchful eyes.