I feel the need to say that the title of the post is a pop culture reference. The book series in the picture above felt quite popular when I was a kid. I saw a post on LinkedIn awhile back where some data bro was trying to advertise his service, and it was called something like Data Engineering for Dummies. The recipient of said advertising got extremely offended that the advertiser would dare call him a dummy.
I am sad to say that the pace of events that make me feel old on the internet is accelerating. I don’t think the advertising recipient understood that it was a joke and a reference to the “For Dummies” book series. At the risk of being too millenial-core on main, I really am not sure the Zoomers know about this reference book series.
Furthermore, can we agree that in the year of our lord {{ current_year }}1, that understanding the limits of your own intelligence is a good thing? It’s not self-deprecating to call yourself a dummy every now and then. I do this a lot, and I’m not fishing for compliments when I do it. I am well aware that I’m fairly intelligent. But calling myself a dummy every so often is a good reminder that I always have more to learn, and that blind confidence in my current intelligence level is a recipe for failure. There’s always more to learn.
I’ve been learning a lot at my new job, and my favorite thing I’ve learned about lately came up while doing a deep dive on dbt’s package ecosystem. Today, I bring you a review of the dbt_artifacts package, developed and maintained by Brooklyn Data. And yeah, I’m sticking with the “for dummies” title. I think I’ve earned the right to come across as a bit outdated.
Let’s get into it. Subscribe first if you haven’t.
dbt Packages with Models
As a dbt practitioner at my last job, I fell into a trap that I feel like probably befalls a lot of us builder-types. We think hey, why would I use something prebuilt when I could just build the thing myself? Nothing wrong with this sentiment at all-I think it comes from a place of wanting to understand the process of building the thing. It might be a model to address a certain business need, or a slick bit of SQL to perform an analysis the CEO just asked for, or even a damn data pipeline. Python really does go BRR and it’s fun to understand how it works.
Generally, I think wanting to understand stuff and avoid black boxes is a great sentiment. I’m very mistrustful of black boxes myself. I feel like that’s why I haven’t been super bullish on the whole AI revolution-mainly because I don’t understand it and I have a hard time trusting things I don’t understand2.
Sadly, this just isn’t that efficient of a way to live—insisting that you build all the things you need to accomplish your data goals yourself. I eventually got over this with respect to common dbt tasks like writing YAML files, base models, and certain SQL functions like generating date spines. I stopped worrying and learned to love macros.
My mega-bestie out of the dbt package ecosystem was codegen. I know dbt_utils gets a lot of shine, and she’s very useful too. She has a lot to offer—date_spine in particular is a very useful function that saved my ass on a particularly tricky time series build. But codegen? She’s a damn star.
I don’t know anyone who particularly loves writing YAML, and if they do, they’re just like my friends in college who loved organic chemistry. Weirdos. Entitled to their weird opinion, but I don’t relate to it at all. Writing YAML before codegen was absolutely the bane of my existence.
But after codegen? All I’m doing is playing fill in the blank like I’m a damn mad libs enthusiast, not a data practitioner.
If you are not using generate_model_yaml and generate_source whenever you’re doing dbt setup tasks, you are missing out. This macro is a queen. She will format your source files for you, generate YAML skeletons for you for each of your models, and all you have to do after she does her thing is copy/paste the code output and fill in descriptions. It’s really not that bad at all.
Figuring out codegen was a total game changer for me. If you’re lucky enough to be squarely in the analytics engineering space, that might seem a bit elementary to you. But for me, I was straddling the analyst/AE space, and had much more to do besides optimize our dbt project. So figuring out codegen felt like a huge win. I stopped spending so much brain space on getting my YAML formatting right. I just used codegen, and started to think more about my project design instead.
Codegen totally rocked my world. But in ramping up to my role as a technical instructor, I’ve learned that there are packages in dbt that have prebuilt models. That’s some galaxy brain shit right there.
You might be asking yourself, hey bitch, why would I want a package that has prebuilt models? How could a package possibly know what models I need for my dbt project? My business is UNIQUE AND QUIRKY AND NOT LIKE THE OTHER BUSINESSES!!!
Well babe, I’m sure that’s true for a lot of your modeling needs. But if you use data from SaaS tools like Facebook Ads, Google Ads, Snowplow, Jira, Asana, or anything like that, I’m afraid your business is like the other businesses. The structure of the data that you analyze from those sources is gonna be more or less the same from company to company—it’s just the content of the data that will vary. If you poke around the package hub, you’ll find a ton of content (especially under Fivetran’s section) premodeling data against those common source systems.
Furthermore, your dbt project generates a shit ton of metadata that everybody and their brother has built paid services off of, for better or for worse. You probably have good reason to be interested in that metadata. It can tell you stuff like how long your models are taking to run, what tests you have applied to how many models, which of your models are materialized using what strategy, and more.
I’m a fan of data catalogs myself, but you don’t necessarily need one to take advantage of your project’s metadata. You can use a dbt package to start dipping your toes into these models and figuring out if and how that metadata might be useful to you in managing your project. I think it’s a great idea to have some sense of your project’s metadata. It’s your first line of defense against unnecessary model proliferation to have a sense of how many models you have and what they’re doing.
dbt_artifacts is a truly sick-ass package. It comes with prebuilt models against your project’s metadata that you can use to start understanding your project’s health on a deeper level.
Using dbt_artifacts
First off, it’s important to be aware that dbt_artifacts is a package built and maintained by Brooklyn Data Co and hosted on the dbt package hub. That means that any bugs that may or may not surface in the package are not dbt Labs’s problem—those issues should be opened with BDC. Just a gentle little reminder for you that using contributions to an open source project mean you need to be aware of who built the thing.
Setting up dbt_artifacts is easy as hell though, let’s talk about it. Like with most dbt packages, installing it is beyond simple. You just copy the installation code snippet from the package hub and pop it into your packages.yml file. If dbt cloud doesn’t automatically run a dbt deps for you after that, make sure you run one.
Next up, you need to add an on-run-end hook to your dbt_project.yml file. If you’ve never worked with these, don’t be scared. Hooks are just SQL statements for dbt to run after that dbt run or build.
So for the dbt artifacts package, you need to add this on run end hook so that the package can actually upload the models built off of your metadata to your cloud data platform. After that, you’re ready to materialize the models that this package premakes for you! Once you’ve added your hook, you can go ahead and run a dbt build.
Once your dbt build completes, you should check your run logs to make sure that the models this package premakes for you were actually materialized.
Once they’ve been successfully materialized, they play by the same rules as any other models in your project. You can ref them and build on top of them! Now you can do shit like count up all your views vs. tables vs. incrementals and assess yourself as a dbt dev—are you leaning on one mat strategy over another? Should you change it up for any reason?
You can also get a sense of your test usage from this package. It will help you see how often you’re using tests & if you’re favoring any type of test over another. Nice, data-informed way to assess your testing strategy and make sure it’s looking well rounded.
The last nice piece of this package I’ll recommend you take a look at is the invocations model. It lets you know how and when dbt is being invoked in your project, both at a dev and production level. You can also assess how often your cloud jobs are getting run from this table. That info is also available on dbt cloud, but maybe you want those stats elsewhere too. You can have them in your data warehouse thanks to this package.
dbt’s metadata is famous for being quite rich. And honestly, if you’re at the same level of technicality as me (moderate), you might find it a bit intimidating to take advantage of. Metadata-based packages help.
So what are you waiting for? Check out the dbt_artifacts package. There’s more options beyond this—project_evaluator and parts of dbt Explorer, to name a few. But for those of us who don’t like that much of a black box and just want flexible yet still prebuilt options to explore our project’s metadata, this is a terrific option.
Ride on, metadata weirdos. Don’t forget to subscribe.
That’s a little Jinja joke for all my dbt heads out there. The girls that get it, get it.
Hahah ouch too much telling on myself for a post actually about data
this {{ source('substack','post') }} is quite {{ ref('positive_sentiment') }} -- but seriously great stuff as always
I’m called out. I’m that dood that tries to write everything from scratch cause I can understand it 😮💨
But I just need to understand when it’s to my detriment. I don’t go around trying to write dbt or python from scratch. That’s obvious - guts I just need to pay attention and try to define when to let jebus take the wheel 🤓