Building a data stack has easily been my favorite professional accomplishment so far. Eventually, my data engineer homie and I are going to write about the process a bit more at length. But for now, let a bitch reminisce.
The latitude to experiment and create process from the ground up, to see some work and others need changing, was amazing. It was also a massive level up for me in terms of my technical ability. I spent 2 years at that analyst job, and sometimes I feel a bit insecure that it was “only” 2 years. While I certainly would not call myself an analytical expert, those 2 years were worth much more than their time on the calendar.
I had to understand the process to spin up a data stack. I had to be responsible for much of the the transformation—>business presentation layer part of our data pipelines. This gave me experience, insight, and practice with technical skills that I don’t think I would have gotten in a similarly leveled position at a different organization. I don’t take how rare and wonderful that experience was for granted.
At this job, once we got our data stack set up and running, it was time to do the actual gritty work of analytics. In the early days, this meant I was building in dbt every day. We didn’t have a good presentation layer right after getting our stack set up. Our source tables were (as source tables so often are!) not in a helpful configuration for analytics work.
My workflow would often look like this:
Receive requirements for a report from business stakeholders.
Discuss and refine what success looked like to those stakeholders. (This part took forever. It’s easily the hardest part of analytics work. Doing this process a zillion times gave me TONS of opinions on it—but I have no idea how portable they are to other industries).
Decide on whether to build a fresh model in dbt to meet the report needs, or make a modification to an existing model. Sometimes this was adding an additional column, or building a view on top of another couple of models.
Make modifications to existing models in dbt, or build a fresh model.
Open a pull request, write it according to our version of dbt’s PR template, prepare any spreadsheets to validate that my work matched what stakeholders were going to expect, and wait for the CI job to pass to make sure my work wasn’t going to break prod. (footnote: vendors who make CI/PR check/validation tools, I DON’T WORK THERE ANYMORE and I am not sending you my coworkers’ way. sorry!)
Once my CI job passed, wait for another data teammate to review my PR and make any modifications that they suggested.
Once my PR was approved, merge it and wait for the CD run-on-merge job to complete that would actually get my work into our production database.
Construct the report in Looker. Send it off to my data teammate for review and to the business stakeholder for review. Each of them received a review template to make the process easier.
Once all reviews were passed & any revisions were made, we considered the report “shipped” and marked it as “production-ready” in our data catalog.
Throughout this process, I interacted with “jobs” in dbt at least twice. Once each for the CI/CD portion, and sometimes more frequently if I needed to re-trigger a job or wait for a scheduled job to finish. I wasn’t responsible for configuring or scheduling those jobs, but I did consider it my responsibility to understand why they failed when they did.
Sometimes, my CI job (to check that my work wasn’t going to break prod) would fail. When this happened, I would read the logs and figure out why the job had failed. I knew what to look for to figure out if the job failure had to do with the work that I was doing, or if it had to do with a bug in the codebase that was someone else’s domain. This wasn’t too bad—I just looked for the red text in the logs and the FAILURE tag. Usually, the job failure was connected to either a model failing to build or a model failing to pass tests. This was easy to figure out if it was my fault—if it was my model that failed, the failure was my fault!
If the job failure was my fault, I knew what to do to address it. This was usually going back into dbt to figure out why the model wouldn’t build or why the tests failed. If it had to do with something I couldn’t control (i.e., someone else’s models, a model I hadn’t touched in my PR, or a database error), I knew my job was to open an issue with our data engineer to log the bug, and then get on with merging my PR.
If a CD job failed, I would follow the same process. Additionally, if the failure of the CD job meant that my model failed to get built, I would also trigger a re-build of the model I was working on as well as its ancestors, if that was relevant.
This workflow was pretty good for us as a team. I was conversant enough in job logs and how jobs worked so that whenever a job failed, I wasn’t constantly going to my data engineer and adding more tasks to his plate. Sometimes I had to. But if I did, it was in an organized manner and put a task on his plate that actually made sense for him to do—as opposed to him fixing my mistakes for me.
I also enjoyed this workflow because I felt a strong sense of ownership over my work and the business domains that I touched. I also felt empowered to understand how data was moving through our organization, and to understand how and why it could fail. I educated business stakeholders about these potential failure points as it became relevant, with the goal again being to share the responsibility for our data stack’s health. (footnote: everyone is responsible for data quality. I get SaaS tools might help with that but man it’s always weird seeing them touted as a replacement for…I don’t know….paying your people to take the time to educate your non-data teammates? I mean, the answer is usually both tools and time, but still).
The empowering feeling of this workflow is why I feel it’s very important for analysts who interact with dbt even in limited ways to know what the hell is happening when their work gets deployed in dbt. If you’re an analytics engineer or data engineer-yeah, dude, you should probably know more than the basics about jobs. But my heart beats for the analyst who is up-skilling technically, so today I write for her. Today, she and I go on a journey to understand how deployment works in dbt. Welcome to dbt Deployment for dummies.
Why do we need jobs?
I want you to know I agonized over whether I should put some definitions of jobs and environments first or the “why” behind jobs and environments. I went with the “why” first. My default state when learning most technical things is: “hey bitch—why does anyone give a shit about this?” It feels motivating.
So, okay. dbt lets you make stuff. You get to make stuff SO STUPIDLY EASILY—all you have to do is write SQL queries. It’ll then take those queries and run them against your data warehouse. But! It will do so with the appropriate data definition & manipulation languages that let most data warehouses create tables and views. You obviously can make more than just tables and views, but those are the most common bits of “stuff” that dbt lets you make.
You have to put your stuff somewhere. That somewhere is gonna be in your data warehouse. If you want other people to see your stuff, or maybe you want your stuff to get read by a BI tool and turned into pretty charts, you need to put your stuff somewhere where people know it’s “done”. And, you need to do that on a regular basis. Presumably you don’t want to get up at all hours of the day to do that sort of thing. You can do it automatically, though! That’s what jobs are for.
What are jobs? Why do I always see them in association with environments?
The place where your “done” stuff goes is your production environment. In order to refresh your “done” stuff with any new data, you need to run your SQL queries against your source data on some regular cadence to grab and transform new data. That regular running is a job. Jobs and environments are Taylor Swift and bodysuits. Can’t have one without the other!
Jobs tell dbt how and how often to run your transformation code to transform newly-acquired data. Jobs must have somewhere to put the results of that transformation code—that’s environments! Your production environment tells dbt where to put the results of your jobs.
What does dbt need to run jobs successfully?
To run jobs successfully in dbt, you need:
A valid production environment definition and connection (where to put your stuff)
A set of dbt commands to put in a job’s execution settings. (how to build your stuff, which stuff to build**, how often** to build stuff)
To define your production environment (or, where your “done stuff” lives), you need connection details to the database and schema(s) that you want your work to go to. In dbt Cloud, this is done in the “Deploy” section of your IDE. These docs are pretty helpful. Connection details are finicky and idiosyncratic from database to database, so you should search dbt’s docs with your data warehouse in the search terms for guides around hooking up different warehouses.
To tell dbt how to build your stuff, you need valid SQL that is saved in your models folder and passes a dbt run in your IDE. When you build your job, in the “execution settings” (in dbt Cloud anyways) you’ll be able to type any dbt commands you want.
You should probably have at least one job where your command is simply “dbt build”. This will refresh all your assets once a day, or maybe every other day. But, there may be models that you want to have refreshed more often because you get new data many, many many times a day. If those models that need more frequent refresh are all in the same folder, you can use selection syntax and have your job be dbt build —select path/to/frequent_update_folder.
A more frequent job that only runs a subset of your models might be helpful for that frequent update scenario, that way you only spend the compute on more frequently updating models that actually need it, instead of your whole dang project.
Finally, you also need to tell dbt when to run these jobs. In dbt Cloud, the selections are pretty simple. You either tell dbt specific hours to run your job on (9 am! 11:15 am! etc) or specific intervals to run your job on (every 10 hours). You can also tell dbt to run a job when a separate job in your project or a different project has finished—but that’s a bridge too far for today.
How do jobs fail?
Jobs are executing a series of commands. Below is typically the sequence of job commands in dbt Cloud.
Clone git repository
Connect to your data warehouse
Invoke dbt deps (you know—the one that gets your packages installed)
Any dbt commands you tell the job to run!
Those are at least 4 possible failure points. If a job fails any one of these steps (with 2 notable exceptions), then the whole job fails.
Failure in steps 1 and 2 probably have something to do with dbt’s connection to your git provider and your data warehouse, respectively. You’ll want to get a look at those settings in your production environment if your job fails there.
A failure in step 3 means there’s an issue with one or more of your packages. This one will be easier to troubleshoot in the cloud IDE—head back in there and run a “dbt deps” to get a look at the logs there. I’d always start with checking package versions against your dbt version—those conflicting can cause you a big headache.
Failure in step 4 means one or more of your dbt commands isn’t running successfully. This is where I ran into most of my problems. Typically the errors associated with those issues are going to point to a specific problematic model. In that case, get yourself back to the IDE and into that specific model. Pull up your handy dandy debug errors guide from dbt’s docs and find the section relevant to you.
Software isn’t perfect, queen. Errors are gonna happen, so embrace reading the logs. You’ll get so quick at it so much faster than you think.
You betta work, bitch
Data, data, data, I cannot make bricks without clay!!
If data is supposed to tell you how to run your business, you better make sure it’s up to date, and that the jobs making that data up to date aren’t failing. Maybe the data helps you understand the circumstances under which customers churn from your product, like helping you distinguish between cost conscious customers and customers whose entire team was laid off so there’s no one to spend the money anyways.
Making sure your data is up to date is as easy as pushing a few buttons in dbt Cloud. And that’s all there is to it! Jobs always run perfectly, transformation logic always builds new tables successfully, and source systems are never changed significantly enough to break your pipelines.
Hahahahahahahahahahaha oh god I almost got myself with the cringe of that one. A girl can dream. At least you don’t have to suffer mansplaining from Sherlock Holmes-ass tech bros who are high on their own fumes. You and I can be dummies together.
Thanks for reading—this was a long one this week! Tell your friends, don’t forget to subscribe, all of the above. See you next week.
I think we might need your help how to curate better pipelines lol