I am deeply boring and I love to write things down
Is it boring to say that documentation is my favorite part of working in dbt?
Honestly, probably. Sometimes I look at myself in the mirror and shake my finger and ask “Why are you the way that you are”? And I don’t have a good answer. It’s a complicated set of circumstances that turned me into a killjoy internet clown who just looooveeeeesss a paper trail.
You can’t have too many paper trails in your life. I keep a work journal at the end of most days. It keeps me organized about what I did that day and what I need to get done the next day. My fiancé and I have a family Notion instance where we store documentation around our house, and our wedding. I also store my Substack posts in Notion until I’m ready to throw them on to the internet in front of God and everybody.
I label everything in my email inbox. You will not catch me trying to find a flight confirmation last minute (well, most of the time anyways). That shit got labeled and filed away with a date and a destination as soon as I booked the flight.
I also have a folder in my car where I keep records of every maintenance visit my car has ever had. I am in no way smart enough to remember the last time I got an oil change, so the sticker on my windshield and the receipt from my last oil change in the car maintenance folder are my saving grace.
I’ve also had the same spreadsheet for about 10 years now that I use to budget after every paycheck. You cannot convince me that any of these milquetoast budgeting apps will ever top my spreadsheet. I am specifically looking to shade Rocket Money right now. All the fancy UI in the world won’t change the fact that you somehow consistently think me transferring money from checking to savings is the same as spending that money. It’s literally the opposite. If Rocket Money has no haters, I am dead.
Maintaining and managing the sheer amount of detail that you have to as a functioning adult will either devour you or make you a fiendish documenter. I am that fiendish documenter. Look at me. I’m not smart enough to just remember things. I spent most of 2014 on Tumblr. Any brain cells I had went entirely to Tumblr lore. I have to spend my remaining brain energy on writing everything down so I can enjoy this one wild, precious life instead of drowning in The Tasks. God save us from The Tasks!
I’m huge on documenting my life. Imagine my excitement when I learned how baked-in to the process documentation is when you work in dbt.
This week’s not really going to be that technical. I’m going to point out where documentation should go in your project, sing the praises of the almighty doc block, and dream about some standards in documenting your project that I think should be the law of the land.
The only difference between developing and screwing around is writing it down
I think the Mythbusters are always relevant.
First things first. If you are not at the very least adding one-liner descriptions to all of your models and columns, who hurt you? It’s really not that hard to do. Check out this docs snippet on where to add descriptions in your model YAML files. I also copied and pasted the example YAML from the docs site below here, because I’m so nice.
version: 2
models:
- name: events
description: This table contains clickstream events from the marketing website
columns:
- name: event_id
description: This is a unique identifier for the event
tests:
- unique
- not_null
- name: user_type
quote: true
description: The user who performed the event
tests:
- not_null
Model names and column names both support a description field. You should use it. I promise it will make anyone who onboards to your data stack so goddamn happy. I promise you that your column names are not self-explanatory, even if you think they are. I’m sure you did a great job naming your model and its columns, but please, think of your neighborhood dummies such as myself who have looked at 9,000 columns of data today and just want to know exactly what your user_type is supposed to be cataloging.
“But Faith!” you moan. “I don’t want to write the same description over and over again every time user_type shows up throughout my data warehouse. That’s so annoying and boring.”
Oh, you sweet summer child. How has no one told you about docs blocks?
Docs blocks, my everything
You don’t have to repeat yourself much in dbt. You don’t even have to write a YAML file from scratch most of the time. If you find yourself repeating the same description multiple times, you need to drop that description in a reusable doc block definition, and then call it wherever the repeating column occurs. Here’s a docs link with a snippet you can copy to define your own doc block.
If you’re familiar with Jinja and macros in dbt, you’ll love doc blocks. You can put whatever markdown you’d like in them, and then just call them throughout your project in a description field whenever they need to be used. Defining one looks like this:
{% docs table_events %}
# This table contains clickstream events from the marketing website.
The events in this table are recorded by [Snowplow](<http://github.com/snowplow/snowplow>) and piped into the warehouse on an hourly basis. The following pages of the marketing site are tracked:
- /
- /about
- /team
- /contact-us
{% enddocs %}
Easy, right? Just add the docs and enddocs tags inside of a .md file, and put whatever you want in between the tags. You can use markdown between the tags, too. Then, when you want to call the doc block, you’ll put it in a description field in your YAML files, like this:
version: 2
models:
- name: events
description: '{{ doc("table_events") }}'
columns:
- name: event_id
description: This is a unique identifier for the event
tests:
- unique
- not_null
The eagle-eyed among you might notice that I used the dbt docs example of a doc block for a model description instead of a column one. You can use doc blocks for both in exactly the same way. But I bring up the model one intentionally. When you’re building new models in dbt, you should use a doc block to make a long-form model description and use markdown with it. Not only will you have rich documentation around your model, it’ll also be easier to read because you added a markdown description! Science tells us that your stakeholders will actually read your model descriptions if you use some markdown in there.
What’s in a good description?
Columns are the easier ones to start with here. A good column description should state the obvious, and point out anything tricksy. I know you think it should be obvious what “event_id” means, but write out the obvious anyways. Are there ever occasions where an event_id might be null, even if you’ve set a data test in dbt to catch that? Do you know why that happens? Maybe add that to your documentation, then.
Maybe something like this: “Unique identifier for a user event on our website. Occasionally null when pipeline XYZ fails. Failed pipeline XYZ fixes fall to johnnyPipelineFixer@company.com.”
Isn’t that nicer than just thinking the column is self-explanatory? Point out the obvious, but let us in on any quirks of the column that you’ve discovered while you’re knee-deep in it. You spilling data secrets in the documentation will always pay dividends.
When it comes to documenting full blown models, I think the following template is good to follow for marts-ready models.
{% docs my_new_marts_model %}
# Description and Motivation
In this section, I take a few sentences to describe what the model is doing, what question(s) it's trying to answer, and why I built it.
# Known Limitations
Here, I'll write (if applicable) any limitations I know this model has.
No model will be perfect because our underlying data rarely is perfect.
In known limitations, I'll note any edge cases that could throw a user off,
and I'll note any particular operation I would avoid using this model for.
# Business Stakeholder
Here, I'll list the business stakeholder's name and email address. This should be the person who will actually be using this model's data. Maybe it's to make decisions, maybe it's to monitor progress.
Either way, the business stakeholder should be the person with the most
business context around this model.
# Technical Stakeholder
This is usually the same person who built the model. Include their name and email address.
This is the person who knows the most about the model and can answer technical questions about it.
{% enddocs %}
I would just template this on my net-new models at the marts layer and fill it out each time I built a new model. I don’t really know if it was earth-shatteringly helpful to my teammates (I hope it was), but it was definitely that helpful to me. When I had to go back to these models, I had an easy reference to put myself back in the shoes of my old self when I wrote the model in the first place.
It’s good for your brain to document things well. It forces you to think slowly and deeply about what you’re doing and why you’re doing it, and to articulate that in a concise description. I know we can’t solve the whole “everything’s an MVP, throw it together and as long as it works it’s fine if it’s a mess” attitude that’s so rampant in the startup world. But we can start fighting it by insisting on just a little bit longer of a wait time for a model you’re making for someone, and taking that wait time to write some kickass documentation.
My last boss frequently told me that I needed to slow down to speed up, and it was good advice. I started taking longer to deliver on data projects because I got really meticulous about documentation, good data testing, and good data validation. This meant that I delivered a high-quality PR on the first try, and didn’t have much to fix in it when it went to peer review.
It also meant I felt as familiar with our data as I did with some of my own friends. Documenting so relentlessly (and gently forcing my teammates to do the same) was a big driver of that familiarity.
What’s stopping you from buddying up to your data and giving it the gift of good documentation? Don’t tell me you’re lazy. Developers love to say that as if it’s a flex. I don’t think inscrutable code that you were too lazy to document is a flex. Take care of yourself, and your codebase. Join me in maniacally documenting your life and your data. I think you’ll like the result.
See you next week. Don’t forget to subscribe. Thanks for reading.