Welcome to another edition of “Faith takes you on the ride while she learns how to do her job.” Today, we learn about a handful of dbt’s model governance features.
Working for a high profile company like dbt Labs means I have to stay abreast of the product’s updates and functionalities, and learn how to actually use them. This is a pretty daunting task, because dbt ships a LOT. But generally, I am trying to be more of a “next right thing” type of person instead of a “holy shit I can’t believe how much I have to learn to be good at this job” sort of person.
The next right thing these last few weeks has been learning how to use dbt’s model governance features so that I could teach them to one of my training groups.
First off though—why bother with any of these model governance features? Who gives a shit, why can’t I just export my Looker dashboard to a csv and mess around with it from there? Well, you could definitely do that. You could also encourage everyone in your org to do just the same and then boom, you have self service data culture! Right guys? Guys?
You know I’m being facetious at this point. “Self-service data” is poorly defined buzzphrase anyways, and a free-for-all at your company’s data is probably going to put you in frustrating situations like I described in my semantic layer for dummies post. Realistically, there’s a number of reasons why you might want to restrict access (or ability to build on top of) some of your dbt models.
Why might I want to restrict model access?
dbt encourages a modular, iterative workflow in data. If this type of workflow is your vibe, then you are probably used to building things like staging models that take care of data cleanup tasks (renaming columns, cleaning up text, concatenating fields, this kind of thing). You’re also used to doing stuff like having intermediate models whose transformations and calculations get used as a building block elsewhere in your project.
If someone is new to your data ecosystem (or maybe they just don’t work in it much!), they might not really understand the difference between int_orders and fct_orders. Sure, sure, the difference may seem obvious to you, but can we please mature past assuming shit that is obvious to you is obvious to other people? If it’s your goal to get folks in your org to interact with your data on their own more often, it’s not a bad idea to give them guardrails. That can look like preventing them from ref’ing intermediate models when they should be ref’ing the final fact table, or whatever mature model you are okay with someone building on top of.
Model access is not the same as database permissions. dbt can facilitate database permissions with the grants config if you want help with that. Managing model access is more about putting guardrails around what people are allowed to build off of either within or between dbt projects.
How can I restrict model access?
Wanna take a wild guess? Why yes, you read my mind, it is in fact going to be YAML!!! Our BFF!!!
Look, I don’t like YAML much myself. But the good news is, you truly never have to write YAML from scratch, and frankly, you never should. It’s a rookie mistake to try. And, getting model groups (which enables model ownership & access controls) set up is really not so bad. It comes in just 3 easy steps.
Declare groups
Declaring groups is simple to define in a .yml file underneath a groups key. Check out this documentation if you want to try it. Then, swear to me on all you hold dear that you will just copy and paste the code snippet in the docs. Then delete the stuff you don’t need and fill in the stuff you do need. Please don’t be a hero and try writing YAML from scratch.
As far as where you put your group declarations, that can be in any YAML file as long as it’s in your models/ folder. Where you put this groups declaration should probably match your overall folder organization. For example, if you have business domain-level folders underneath your models folder, it would not be a bad idea to have a groups.yml file directly underneath your models folder and define groups to match with your business domain-level folders.
When you define a group in dbt, you can also define group owners! That’s as easy as adding an
owner:
key underneath the group’s name, and then nested under that, the name and email of the owner. This bit I really like, because I find data operations go a lot smoother if you have clear subject owners of each piece of your data. That way, no one is expected to know everything, but they are expected to be experts in the data that they own.Apply groups
This is even easier than declaring groups. Have some docs, as a treat.
Remember materializations? With materializations, you can assign entire folders a materialization, assign a model’s materialization in its YAML, or in the model’s SQL file itself. Model groups play the same way! If your folder structure matches the group structure you want to use, I would just apply groups at a folder level in your dbt_project.yml file. It looks like this:
models:
marts:
finance:
+group: finance
Additionally, just like materializations, the most specific config is always going to win. If you have a fct_orders model in your finance folder and you assigned the models in the finance folder the finance group, then fct_orders will be in the finance group. But, if you then go into the fct_orders YAML and assign it the ops
group, it will now be in the ops group.
dbt is a lot like humor. Greater specificity pretty much always wins.
Modify access if you need to
When you assign models to groups in dbt, they will receive a default access setting of “protected”. In dbt-ese, “protected” means that other models in the same project can reference a “protected” model. For many of your models, you will definitely want this to be true. However, you may want to prevent people from building on top of a foundational set of models when a more mature dimensional model is available. In that case, you’ll want to use private access instead. That way, only models in the same group as the private model can ref it.
Model access may often track with group assignment, but you can definitely mix access levels within a group. If your models are mature and ready to be shared with other groups or projects, they should be protected or public. But if you’re working on a new model, it’s probably a good idea to set its access to private to prevent any other project resources from relying on it (unless you intend for them to do so).
It’s wild that I’ve already spent nearly 1300 words and I only got through model groups and model access. The important things to remember are:
Groups are collections of nodes in a DAG.
Groups can be assigned an owner, and you can specify the owner’s name, email, slack, and github.
Assigning models to groups is done very similarly to materializations—either do it in your dbt-project.yml, a model’s YAML, or a mode’s SQL file.
After you’ve assigned models to a group, you can change their access depending on what you want. The default is protected, which means all models in the same project can ref the protected model.
Tune in for the next dummies post coming soon. If you also stick around for my non-dbt related writing, I appreciate you. It doesn’t always feel right to promo those posts on LinkedIn, so I am grateful for those of you that read them anyways.
Subscribe, tell your friends, you know the drill. Thanks for reading.