dbt's model versions for dummies

it's me i'm the dummy

May 06, 2024

Shoutout to Jenna Jordan for this blog idea.

Microsoft Excel World Championship 2023 - Finals

If you haven’t watched the Microsoft Excel world championships, it rules. I’m a spreadsheet enjoyer to my core. In my stats class I took in grad school, I found myself using the tried-and-true and janky-as-hell version control method of naming each of my worksheet iterations “analysis_final”, “analysis_final_final”, “analysis_final_v2” and so on until we reached the promised land of “analysis_final_you_dumb_bitch_do_not_edit_this_again”.

It’s nice to see echoes of my spreadsheet heritage when it comes to dbt’s model versions—because in using model versions, you do actually add a “_v2” or whatever number you’re on to the end of the model. It’s a better feature than my janky-ass homegrown version control of statistics homework, but still, it’s nice to see nods to the heritage of many of us analysts.

Join me down the rabbit hole today while we learn model versions—the dbt way.

Join me in scenario land

You belong to the world’s most fabulously bleeding-edge and well-organized data team. When you have new ideas on how to do things a little differently, your boss lets you try them out! Your boss shares your enjoyment of experimentation and of always trying to push the envelope. You’ve set up a well-organized multi-project (or multi-group) dbt architecture. Within this architecture, there are several models that are produced by one group, and consumed by another group in their downstream models.

Since you’re well-versed in best practices around interfaces between multiple groups or projects, you know that any model produced by one group (or project) and consumed by another should be contracted. It’s the polite thing to do. And, you are an avid reader of this here Substack, so you’ve read Model Contracts for Dummies. You know that in a cross functional data workflow, you need to be considerate of people on other teams using your models, and contracts are one way to be considerate and avoid breaking schema changes.

You’re a busy bee! Some of your tables are a bit more finicky than others, though. In particular, you’ve been keeping an eye on your fct_orders table lately.

fct_orders was built by the core project, and is consumed by the finance project. You made the model’s access level public to allow the finance project to build on top of it. You enforced contracts and constraints on this model, because the finance team depends on it for accurate reporting of sales, and all-important revenue. Things are going well so far, and the finance team hasn’t been surprised by any changes in the model’s schema, or surprised by any sneaky null values where they don’t want them.

Until this week hit. You got some bad news this week.

Your order management system (y’know—the source system that feeds fct_orders?) is pushing a pretty major software update. It’s not going to record two columns anymore—order_type and order_status. Instead, it’s axing those columns and introducing order_progress instead. You can see why they made the change—it’ll allow you to have a more granular view of where your orders are as they are being fulfilled.

But you enforced contracts on your original fct_orders model! If any columns get dropped, the contract will be violated, and the table won’t build. Shit!!

You’re fine—introduce a fresh version of fct_orders.

Here’s how you’re going to avoid a data outage. First—we’re going to assume that updating fct_orders to reflect the 2 columns getting dropped and 1 column getting added is as easy as changing what’s in the select statement of your query. This may not always be the case. But we’re going to assume it’s that easy for now.

We’re also going to assume that this source system change means that order_type and order_status are simply not going to receive any new values. This means that when the system change goes live, order_type and order_status can still exist on fct_orders. They will just be stale, legacy columns. We’ll also assume that the addition of order_progress is a simple column addition, and that without changing our transformation flow, we will simply miss information in order_progress and nothing else will get messed up.

Now that we’ve got our scenario assumptions out of the way, let’s fix this data change.

We’re going to copy all of the SQL involved in creating fct_orders, and paste it into a new sql file called fct_orders_v2.sql. We’re going to put that SQL file in the same folder as the original fct_orders. We’ll take some time to modify the SQL for fct_orders_v2.

In this case, we’re just going to add _sunsetted as a tag to the order_status and order_type columns, and then add order_progress in our select statement. Before we move on to properly configuring this model version, we’ll run a dbt build and preview our model to make sure that it’s actually playing nice with the rest of the DAG, and that the SQL is returning results we expect.

If we’re feeling really fancy, we can even run an audit_helper comparison between our original fct_orders model and fct_orders_v2. Audit helper will definitely return some differences because we added a new column and sunsetted two other ones—but it might be a nice way to make sure we’re not accidentally causing other problems.

Alright, we ran a successful dbt build and now we have a fct_orders_v2.sql in our warehouse. If you push fct_orders_v2 to prod, it’ll build right alongside fct_orders, they’ll just have different names. But what about ticking over all downstream usage to the new version?? What happens to the refs to fct_orders downstream? Well—one question at a time. First off, let’s talk about the sunset period for fct_orders.

We do not want to disrupt the finance team this week. The change of the source system came at the same time as month-close. They need some time before they are ready to rejigger their downstream work to fit this schema change in fct_orders. We’re going to use some YAML to enable changes to the ref behavior, and to formalize which version of the model is the most up to date.

Friendship ended with SQL, now YAML is my best friend

We are not going to make a new YAML file for fct_orders_v2. We’re going to head into the original YAML file for fct_orders and add versioning specifications. Take a look at the YAML spec below. I copied it from dbt’s documentation and modified to fit this scenario. I didn’t write more of a YAML spec for this model beyond the example because Notion doesn’t support codegen and I hate freehanding YAML.

models:
  - name: fct_orders
  # Note this!! If you ref fct_orders without specifying the version, it'll resolve
  # to version 1. 
    latest_version: 1
    #note!! this is for deprecation process below.
    deprecation_date: some_date
    config:
      materialized: table
      contract: {enforced: true}
    columns:
      - name: order_id
        description: This is the primary key
        data_type: int
      - name: order_status
        description: when the order is done
        data_type: varchar
      - name: order_type
	      description: what they ordered
	      data_type: varchar
    
    # Declare the versions, and highlight the diffs. 
    # This is where we make dbt aware of the presence of two versions!
    versions:
    
      - v: 1
        # Matches what's above -- nothing more needed
    
      - v: 2
        # Removed a column -- this is the breaking change!
        columns:
          # This means: use the 'columns' list from above, but exclude 
          # order_status and order_type 
          - include: all
            exclude: [order_status, order_type]
         # I added order_progress down here because it's specific to v2. 
         # This type of example isn't in the docs though so I'm speculating a bit. 
          - name: order_progress
	          description: better order status column
	          data_type: varchar

This part is where we need to pay attention to what we name our models. While you’ve been dealing with this change, you named your new version of fct_orders, fct_orders_v2. Next, you need to decide when “v2” will be the latest version. This process is gonna involve a bit of manual work from you here, but I promise it’s not too bad.

Here’s a sample deprecation process for fct_orders:

In the YAML file above, keep “latest version” as v1. Don’t change your original fct_orders filename yet.
Specify a deprecation_date underneath the model’s name. This will alert people running that model that it’s flagged for deprecation, and it’ll do so in dbt’s logs when they run fct_orders in their own dev environment.
Over-communicate with the finance team and them know how much time they have before you will be introducing the new version of fct_orders. Give them at least 2 weeks, just to be nice.
When the deprecation_date arrives, head back to the fct_orders.yml file. Change “latest_version” to “2”. Change the name of your original fct_orders.sql file to fct_orders_v1. Your fct_orders_v2 sql file can stay as is, or you can remove the V2.
Remove the deprecation_date from the fct_orders YAML—you’re now referencing an up to date version.
Anyone using fct_orders downstream will not need to change their ref when the deprecation date arrives. As long as you have gone back and changed the latest version to “2”—any references to fct_orders will resolve to that latest version. Folks who love to be specific about it can add a version pin in their ref, check out the docs for how to do that.
Boom! Now you have successfully sunsetted an old model, and provided your downstream users with a migration pathway!

Now, is this a perfectly automated process that lets you avoid having to talk to stakeholders about your model’s versions?? No, my love, we have to talk to the people who use our data. No amount of technology will solve that.

If I have any beef with data professionals it’s that it seems to be a frequent sentiment that enablement and communication are problems to be solved with software, and that’s just never been the case. Developing good relationships with your coworkers is actually going to solve your communication and enablement problems. I really wish my fellow data pros would lose our allergy towards just fucking talking to people.

Anyways. Not to comment-bait, but I’d like to know if any of y’all have tried out model versions and if your deprecation process looks anything like what I’ve outlined above. This is a fairly new feature in dbt Cloud & Core, so I haven’t heard a ton from “the field”, as it were.

Thank you for reading, and for subscribing. If you haven’t subbed and you want my newsletters in your inbox, please subscribe! It feels like a reward to me, so if you want to give me dopamine, smash the subscribe button.

See you next week. I’m off to take a walk to try to shake a brutal cold. Spring arrived on my doorstep and said “do you want suffering?” And apparently I said yes.

faith.facts

dbt's model versions for dummies

it's me i'm the dummy

Join me in scenario land

You’re fine—introduce a fresh version of fct_orders.

Friendship ended with SQL, now YAML is my best friend