How Slack fully automates deploys and anomaly detection with Z-scores

Abi Noda: This is the Engineering Enablement Podcast. I’m your host, Abi Noda. Sean, thanks so much for your time today. Excited to dive in.

Sean: Yeah, for sure.

Abi Noda: This conversation today is based around this really interesting article you published on the Slack Engineering blog about how you all automate deploys, but before diving into that topic, just share with listeners a little bit about your team within Slack and what you focus on.

Sean: Sure. So we’re known as the Release Engineering team. We’re a team of about five, and we own various pieces of the infrastructure around building and deploying Slack’s webapp codebase. And most of Slack runs on this monolith we just call The Webapp. Its gigabyte’s large, mono-repo with hundreds of developers working on it every day, so getting this thing built and continually deployed just has its own unique set of challenges, so we have a team dedicated mostly to focusing on just it.

Abi Noda: I’m curious about the monolith journey. A lot of organizations of Slack scale have hit a point where they’re kind of moving away from the monolith, or decomposing, or instituted kind of hard boundaries within the monolith, and just high-level. Have there been conversations around that, or has the monolith been serving you guys well, and what do you think worked well for you all to succeed with the monolith, aside from the things we’re going to talk about today?

Sean: Yeah. I mean, I can’t fully speak to how those conversations have happened at higher points in the organization or whatever, but it’s one of those things you’ll hear mentioned every now and then, but never sort of taken that seriously, because I think just our developers are productive within the monolith, and it serves as well. And I mean, in some ways, operating it is a little bit more difficult because of the scale of the single application, but I think in some ways, it allows us to just focus on product more, right? Instead of having a bunch of engineering teams supporting a bunch of different applications and sort of replicating that work and stuff, we can have just a fewer, smaller, dedicated teams focus on its infrastructure side. So at least from my perspective in the organization, that’s how it looks.

Abi Noda: Back when I worked at GitHub, we had a monolith as well for most of like github.com, and one of our big challenges with the monolith was the release process because of how many engineers are working in this one code base and the code base needs to get deployed all at once. That’s really what we’re here to talk about today, is how your team at Slack has been able to automate and innovate on that process, but I want to start sort of pre all the amazing work you’ve recently done. Tell me a little bit about how deploying worked before the recent tooling that you’ve built.

Sean: Sure. So we had this Deploy Commander Schedule. So of the hundreds of developers that worked on the system, they would receive this DC training, Deploy Commander training, and would get these two-hour shifts that went mostly around the clock. We’d hand over to APAC in the evening and stuff, and you would come on, take a handoff from the previous DC, and your job would be to manually walk the Webapp through its deployment, watching this dashboard with hundreds of graphs on it, and doing some manual testing and things along the way. So when you reach dog food, make sure you can still send messages, stuff like that, right? Yeah, it was just all very manual, but our team managed that schedule, so that was fun.

Abi Noda: And what were some of the pros and cons of this approach? Was this a stressful … Was it like being on call, or was it fairly calm? What worked well? What didn’t?

Sean: I mean, it worked, I guess, technically, right? The Webapp got released would be a pro, but it was definitely stressful. Of course, we’re Slack, everything happens in Slack channels and stuff. We have UIs for our deployment stuff, but we would hang out in the deploys channel a lot, and you could just see these developers are only, it’s maybe the first time doing it, or they only do it every few months, and of those hundreds of graphs, one line goes up into the right, and they don’t know how serious it is or if that’s normal, and it would just freak them out a little bit. You could tell they would sort of not know what to do with that graph, ask a lot of questions.

It would take longer for incidents to get started and stuff like that, and that wasn’t always the case. We had some real veteran engineers that would do great and stuff, but you could really tell it was a mix of experiences there, and it was completely reasonable that if you hadn’t done it before, deploying something this large affects this much of the company and this many of your teammates and stuff. It was an uncomfortable process, I think.

Abi Noda: And who were the deploy commanders? Is this volunteer, or was it like every product engineer at some point required to do like a rotation?

Sean: I believe it was volunteer. The rotation was, I think around 150 people, and of the four or 500 developers in the pool. So yeah, we would sort of go out and do campaigns with like, “Hey, come get the training. Help deploy.”

Abi Noda: We’ve talked about how this is the Webapp, the one big monolithic code base, but you talk about how there’s hundreds of charts. Share with listeners a little bit of context about the complexity of the deploy itself. We were talking earlier about how it actually gets deployed to different clusters and places. Share a little bit more about the scope and complexity of where the deploy actually lands and how it lands.

Sean: Sure. So we think of it on the deploy side as a single deploy, right? Like we say, “Deploy to dog food,” or, “Deploy to 3% of production in AZ2,” or whatever, whatever sort of place that we’re on, but behind the scenes, that’s going into different clusters that handle different aspects of the application. So our async job workers run on the same code base and will receive that deploy at the same time as the files cluster, which is scaled differently and sort of handles the files and points and stuff like that, right? So those infrastructure details are handled by other reliability engineering teams that are specifically focused on scaling that infrastructure. So from our perspective, it’s just like we sort of bucket them more broadly.

Abi Noda: Yeah. So you had this deploy commander process, and you started looking into, “What can your team release engineering do to give these deploy commanders better signals, make their job a little easier, a little less stressful?” What were some of the early things that you looked into?

Sean: Yeah. So the joke was just we were going to have just a single UI with like a red or a green button for go, no go, and that’s all deploy commanders would watch, right? So that was a joke, but sort of still an abstract goal, to have something that could say like, “Hey, things are bad or things aren’t bad,” and they don’t have to be expert on a million different things. So at that point, the team went out and started looking into, “How do we even decide that?” Right?

“These humans are having a hard time telling things are bad or things aren’t bad, and so how have people solved this problem at other companies, or asking internal data science teams, are asking just SRE teams that are internal? Has anybody seen anybody solve these kinds of problems, and how did they solve them?” I think it was my teammate, Harrison, actually came in with Z-Scores, and it ended up being a very simple way to implement this … Well, functionally simple way to implement this, and so decided to start that way like sort of the most simple algorithm or statistical methods that we found from the outset and see how they worked, and if we needed more, we would add more.

Abi Noda: So, of course, as we talked about before the show, we want to really unpack what is the Z-score, how have you all implemented it, how does it get used? But before we even go into that, share, why is anomaly detection a hard problem at Slack, or was it, before you solved it?

Sean: Yeah, I think anomaly detection is a hard problem in general because it feels like a very human problem, or at least domain of humans to solve, because most of the time, when you’re thinking about alerting or watching graphs, you’re setting static thresholds, right? You’re saying like, “This needs to be above 10 errors per minute for five minutes, and then page human,” and then the human will decide how bad it is, but it’s too static, right? You’re not asking, “Is this anomalous?,” you’re sort of asking, “Is this objectively good or bad?,” when you’re setting those hard thresholds, and it results in a lot of noise, it results in a lot of tuning over time. With a system as large as Webapp, that’s a whole lot of noise and a whole lot of tuning all the time to get hundreds of graphs in line, and so rather than asking, “Is this objectively a good or bad amount of errors,” if you change the question a little bit to, “Is this deployment making things different, or better, or worse than a previous deployment?,” so instead of that like objectively good or bad, “Is there a difference?,” then you can use statistical methods to figure out if some behavior is anomalous without worrying about those objective lines that you’re drawing in the sand.

Abi Noda: It does, and I think you’ve given a great layman’s description of what a Z-score is. So now, let’s move into more of the technical side. So explain to listeners what is a Z-score, for folks who don’t have a stats background, and how exactly did you implement it within your systems?

Sean: Yeah. So the not fun description of a Z-score is it’s the data point you’re worried about minus the mean of all the data points, divided by the standard deviation. So what that tells you, what that little equation gives you is the size of a spike in the graph, and so we pull the last three hours of data, every single data point, and you calculate the mean and the standard deviation, and then for the past five minutes, we calculate the Z-score, filling in the data point in the equation, and, yeah, we use sort of that threshold of like … The Z-score is the badness, right? It’s the size of the spike.

So a Z-score of three, for example, is typically, it depends on the shape of your data, actually, but it’s typically in sort of the 99.7 percentile, I think, of data. So you have something that is extremely bad for the Z-score of three, which is what we use as our threshold for a lot of things.

Abi Noda: So Z-score gives you this number, and from what I understand, you’re tracking the Z-score in real time, for deployment that’s gone out, and then with the Z-score, you still do have an interpretation step as well, like thresholds on like, “What’s a Z-score that may need attention?” So explain how you come up with those thresholds.

Sean: To be perfectly honest, we just had to play with it a little bit. I think it was looking at the data and looking at sort of like, “Okay, we see a big spike. What was the Z-score for that spike?,” or for maybe some past incidents, things like that, and just ballpark looked like three would be a good number that’s sort of a objectively bad, large spike, a definite change in your behavior that’s not just like a small, one more error per minute or something like that necessarily, and so we started there. And then, the original release of release bot was just sending messages into our team’s Slack channel, and so we were able to see what it was thinking without interrupting our deploy commanders, and then that was a good way to sort of like watch live, and we, in helping deploy commanders, could know like, “Okay, the bot detected a Z-score of three here, and we see that it was right. There was something objectively bad happening,” and so we were just able to build a little confidence over the course of a few weeks, few months, and that number has just been our magic number. It’s worked out for us, so that threshold of three.

Abi Noda: For listeners without a stats background, I believe Z-scores can really be any value, but a typical kind of rule of thumb is that it’s not likely that they’ll go above or below three or negative three, and so when we refer to a Z-score of three, that’s really at the sort of high-end, highest end of values we would see as far as identifying an anomaly. One of the challenges with the static analysis approach of how many errors or things are being reported, as you said, there can be an interpretation step that’s challenging for a human and probably a lot of false positives. What have you seen with Z-scores? Are there still a lot of false positives as far as Z-score is telling you that there’s a problem or anomaly, but it was just noise? Has that improved with this approach?

Sean: It’s definitely much better. We actually … And I also mentioned this in the blog post, we found that the Z-scores were almost always worth looking at, even if it alarmed, and then sort of came back down to baseline fairly quickly afterwards. We were grateful to have had the early alarm, to know like, “Hey, this graph spikes. Let’s get some eyes on it before bad things start happening,” but our thresholds were too noisy, because we were still monitoring thresholds in the background, because, well, we figured people had put a lot of work into those, and we were paging humans based on that.

And so it was a good objective definition of badness in the system, but those we did find too noisy, to the point where we were thinking about completely getting rid of them. And right before we pulled the trigger on that and getting rid of the thresholds, we moved into a dynamic threshold system. We wanted to honor the thresholds unless it was obvious that it just wasn’t a very good threshold. It wasn’t set really correctly, and that it was going to be noisy, and so we decided to take the average of the data over a longer span of time. So for Z-scores, we look at every three hours, and then for these thresholds, we decided to look at every past six hours, and we take the average of that, whatever it’s been sitting at, so if it’s been sitting very close to the line or frequently bouncing above the line, the average might be above the static threshold, and we would use that average instead of our static threshold if the average was higher. So we look at sort of the average and the static and take whichever one is higher, and then use that to alert.

We still found that that was a little noisy, and that with the six-hour range, we were going back really far, but you might have something that happens at 2:00 PM every day or whatever, right? And it was just you have some spike at 2:00 PM every day that does affect the average, but not enough to sort of bring your dynamic threshold value up to where you need it to be, and so we wanted more representative data points to include in the data set, so we went and we’d basically grab like if today’s Wednesday, so we would grab Wednesday from now, six hours into the past, and then we would look at Tuesday, now six hours into the past, and then we would look at a week ago, so last Wednesday from now, six hours into the past, and then we take the average of all of that and run the same algorithm. And that’s how we dealt with tamping down on the noisiness of those thresholds.

Abi Noda: Really interesting, the approach you’ve taken to these dynamic thresholds. I want to transition from the technicalities of how you calculate anomalies to, “How do you actually use this information and present it back to the humans for people to act on, or be informed?”

Sean: We have the Release Web UI that you can go to see the status of everything, see the status of the current deployment, and see open events. So you can see we have renderings of the charts that we’re monitoring, and we’ll show any open events, stuff like that. That is not how most of our users interact with the system, though. We, again, go through Slack for almost everything. So in our deploys channel, we will start a new thread anytime there’s a grouping of events, and so it’ll communicate out like a breach in PHP errors, and then if a bunch of other things are happening at the same time, it’ll thread underneath that message, so we’re not totally spamming the channel, and we try to include as much useful information in there as we can.

I think at the time I published the blog, there was three status circles we would put next to it as far as a white, blue, or red that had different meanings. So a white was a threshold breach, and so it’s like, “Eh, we saw something. We don’t really think it’s that important.” Right? “And blue was the single Z-score, so hey, a graph really did spike, but we don’t know what’s going on.”

“Red is multiple Z-score breaches. That’s a lot of graphs of just spikes. Somebody please do something.” And so automation stops. We send the message. We’ve started including more context in there too, right?

It’s got links to Grafana, the graphs in Grafana. If it’s certain categories of errors, it will come with links to our log indexing to our search, broken down by the kinds of errors we’re seeing in the deploy steps we’re seeing and stuff, so it’ll take you directly to hopefully the error messages that can get things going, but yeah. Recently, we’ve added a few extra levels. We’ve added more cardinality to our metrics. So early on, we were monitoring everything at the system scale, so 500s are up. We didn’t know if they were up in deploy step one or deploy step five. 500s were just up. And so since then, we’ve added both deploy tier, so that’s like staging dog food production, canary, and AZ, because we deploy in … We’re in AWS, and so that’s how we bracket our stuff, and that allows us to see very specifically like things are alerting, and are they alerting within the bounds of the deployment? Right?

Is it the things that we just deployed that are actually alerting? So we’ve introduced some additional levels of … There’s siren, and then there’s a flaming elbow emoji that are sort of severe and very severe tiers based on sort of not only like … It’s a combination of the severity of the alerts and how confident we are that the alert was caused by the deployment, or within the scope of the deployment.

Abi Noda: So you’ve built these fairly sophisticated UIs and Slack integrations to get the data to humans, implemented fairly sophisticated Z-scores and anomaly detection. From what I understand, at some point, you sort of A/B tested the release bot against the humans to see if the humans could be taken out of the process. So share with listeners a little bit about how you tested that and what the outcome was.

Sean: Yeah. So, like I had mentioned, we let the bot run for quite a while just talking to our channel, and anytime we implemented or made tunings, or changes or anything, we were bringing it more into our channel, and we fairly quickly started piping those into the main deploys channel because it was doing so well versus humans, right? This thing will alert as soon as it sees an issue and just, we have fantastic engineers, but you’re just not going to compete with a computer as far as vigilance and timing goes. And so the bot would always be sending messages before the human came in and said like, “Hey, this graph’s weird. What do I do?,” or, “Hey, I’m going to call an incident because of this.” Right?

We would always get those bot signals first. And so we just kept things going directly to the deploys channel, or we started sending things directly to the deploys channel, and that way, the person responsible for the deploys was also getting the signal earlier, like it just sort of felt irresponsible to be hiding it in the background in our team channel. I think at that point, it was just a matter of the humans hanging out with the bot. The humans also, our DCs also started relying on the bot more and more, right? Like it would be sort of like let the deploy run and sort of just step back, go get some coffee and stuff, and it’s like, “Why are we scheduling people for this anymore, instead of just letting the bot push the button.”

Right? The humans aren’t adding much at this point when they’re just fully trusting it to do the monitoring for them, so let’s use it.

Abi Noda: So you started letting the bot press the button and took humans out of the process. I mean, it must’ve felt great for your team, right? Mission accomplished and more?

Sean: Yeah. I mean, that was a big step. Our team ended up sort of taking on a little bit more of those deploy commander responsibilities if the bot was acting up or if an incident did need to be called and stuff, but because we didn’t need somebody sitting there, watching it 24/7 anymore, we didn’t need to sit there. The bot, at this point, will stop deployments, and then just page our team if it thinks it needs help. And then from here, it’s allowed us to focus on more deploy safety work.

Because we’ve automated it, because we have more control of the process in our code, we’ve started doing things like automatically rolling back if we detect there’s an issue, or … Well, that’s the big one. It’s mostly rollback work. And so we’ve been working for a while, trying to encourage people to hotfix less and roll back more, and so we started doing things around that, where bot, when it detected an issue, we’ll send in Slack like, “Hey, I stopped deploys.” And if you want me to roll back, there’s a rollback now button actually in Slack that you can just push.

And so then, I went to an engineering all-hands and showed it to everybody, and safe to push. We encourage you to push it. Please push it. So that was fun. I think I sounded like a crazy person in front of everybody, asking them to push the big red button or whatever.

Abi Noda: And share with listeners, why are you encouraging developers to roll back?

Sean: It’s generally the safer, faster path. When things are going wrong, I think we, as humans, want to fix our mistake, and we just want it to be better, and you want to show people that you can fix the problem that you just caused or whatever, right? And so we have a blameless culture, but you still feel that, and I think people just want to get their fix in, and get it out, and it’s just it’s done and it’s over, and they took responsibility for it, and got it out. But the problem with that is that you have to identify the issue. You need to make the code change.

You need to wait for a bill to happen, and then we need to deploy forward, which is we can emergency deploy, but we don’t necessarily like doing that, so then you have to deploy forward and go through all the deploy steps. Our rollback, on the other hand, is instantaneous. It’s our fastest deploy. It takes just a minute or two to roll prod back. And so rather than making a change under duress, that’s going to take longer, that might not even work.

I think I’ve definitely been in that situation before where you hotfix something, and it’s still broken because you didn’t do it, right? And so it’s like, “Let’s just clear the table, get it fixed. We can revert your PR, work on it when you’re not stressed out. I understand the impulse to sort of hotfix, but it’s just not the right call the majority of the time.” And so we want to give our developers as much flexibility as they need, and we want to not put up too many barriers between them and the tools, or require a bunch of sign-offs. If something needs to be hotfixed, they need to be able to do it.

We’ve been really working on ways to kind of nudge people towards rollbacks without being … So throw up a big button in Slack that says like, “Roll back now.” Make it easy and obvious, right? And then, also, at this point, if the bot is sure or as sure as it can be, that it wants to roll back, it’ll just do it automatically at this point, but yeah. And then, if it’s a little less sure, it’ll just send the button.

Abi Noda: So rollbacks is one example of a post-deploy commander world problem that you guys are focused on solving. Just a couple other examples you shared with me earlier that I’d love to talk about, one of the problems or challenges you mentioned was trying to find the right balance between giving developers good information and context, but not interrupting them and bombarding them too much. So share a little bit about that challenge right now.

Sean: Yeah, sure. I think most developers would rather not think about deployments as much as possible, which is reasonable, right? They want to get this task done, merge their code, move on to the next task, right? And if there’s a big trail of hours or days or whatever later, you’re having to deal with this old change, and it’s not good. You want them to be able to just focus on the thing that they want to focus on and need to focus on to do their jobs.

And so there’s this hard balance of you want to give developers as much insight into their deploy as possible, right? They might have specific questions. They might want to track this specific change out because it’s risky for some reason, or it’s their first change ever, or whatever, right? You want to be able to give them all that context as easily as possible, but make it ignorable. Like if they don’t want to think about it, they don’t have to think about it, so you don’t want to be annoying on the other hand, right?

That’s sort of a constant struggle for our team, I think, and I think we really try to be sensitive to that, though, because we really don’t want them to have to think about it. And so I mentioned we have the UI, and everywhere in the UI, it’s like this build is going out, and then you can click on the build. It shows you a list of PRs with the code owners, with links to GitHub for that PR, and so it’s just everything is a link to the build. We do our builds in Jenkins, so everything’s right there, organized, easy to find for every deployment. And then also, like I said, unless you’re sort of tracking, you really want to watch your change go out.

You’re probably not looking at the UI, but developers still want to know like, “Just did it go out?” They want to have some vague awareness that it happened, and so we’ll DM them in Slack through the releasebot and say, “Your PR, blah, blah, blah with a link to it is going out to canary 100%,” kind of thing. But we also, for example, a recent ticket I just made based on some developer feedback that we haven’t implemented yet, but I want to implement, is we want to thread the messages in people’s DMs, because they’re like, “I don’t care about every step.” Like, “Tell me that it’s deploying.” That’s fine, because we have a commercial environment and a gov environment, and it’s like, “You’re sending me 15 messages over the course of a few hours about this deployment.”

It’s like, “I want to know it’s there, but stop pinging me.” So it’s like, “Okay, let’s thread these messages so we’re not annoying people, but hey, it started.” They’ll see that first one, and then otherwise. Unless they want to go look at the thread, they can ignore it.

Abi Noda: That’s awesome, and great bit of advice for anyone who built Slack bots and apps, like we do here at DX. Another really interesting challenge you brought up was that with automated deploys, there’s a lot more deploys going out, but these sequential releases or sequential set of builds that are released leads to a little bit of like a traffic jam sometimes, and you’re working on addressing that. So share with listeners a better description of the problem than what I just stated, and how you guys are thinking about solving that.

Sean: Sure. So this was another thing I mentioned just because the theme of this podcast, being developer sort of experience and productivity, and I thought it might be useful, but it’s something that we’re just starting to really think about, haven’t fully implemented yet, but we are seeing an increased number of PRs per day, which is a good thing. We want our developers to be more productive, so win for Slack, but it’s making our deployments too large. We, as the deployment team, do not like large deployments because when the bot stops and says, “I detected a problem with this deployment,” and you’re looking at, “Okay, what did we just deploy?,” I want to look at a list of three changes, not a list of 50 changes, because if there’s a list of 50 changes, I’m probably having to pull 50 people into an incident channel and say, “Hey, can you verify that your PR isn’t the thing that …” And then half people are annoyed because they’re like, “If I made a front-end change and this is a database, what are you talking about?”

And it’s really disruptive and it takes longer to fix the problem, and so it’s like hurting the reliability of Slack. We need to keep these as small as possible. And we originally addressed this by speeding up deployments, and we didn’t change anything about the monitoring as part of that effort. That’s actually a lot of our work that we did last year. We went in and put basically tracing into all aspects of the deploy, and we were able to tell where we were spending time doing things. So there were things like we tell a server to deploy, and then that server would spend, I think a minute and a half downloading a Docker image, and then it would drain, and then it would warm up, and then it would be ready to go, right?

But we started pre-downloading the images, right? So when an earlier stage starts deploying, we download the image everywhere, and now, we don’t have to waste that time, so we made a lot of optimizations like that, that cut about 20 minutes of just dead time out of our deploys. So we’re at our maximum speed. We can’t go any faster without directly affecting the safety of the deployment. And so as our developers become more productive, we’re only doing one deploy at a time, so during our peak time a day, it’s not uncommon for us to have 20, 30… I’ve seen as high as 50 PRs, building up behind the outgoing deployment, and yeah, we don’t want that. And so we can’t speed it up, so we have to increase the throughput of the pipeline, basically. We have to increase the number of deploys that we can have on the pipeline at a time, and so we’re considering at least like deploy number one moves to dog food, then deploy number two can be put on staging, and so we can deploy four artifacts at once, which the benefits there are we’re pulling builds off of the queue faster, and so we’re able to get more deploys per day and smaller deployments. Another really cool benefit is that the buckets aren’t equally sized, so staging is a smaller deploy step. Dog food, canary, they’re all smaller.

They only take a few minutes, but production is a lot larger and it’s a lot slower, and because newer builds are waiting for older builds to go out, that means they spend longer in staging and they spend longer, so we’re not only going to be able to deploy more builds per day and smaller builds, but those builds are going to spend longer in pre-production, just hanging out, being monitored there. It’ll hang out in a pre-production step for 15 minutes instead of three minutes, for example, stuff like that. So see a lot of benefits there.

Abi Noda: Well, I can tell this is fresh on your mind or on your mind right now. Sounds like a really, really interesting problem, and appreciate the insights you’ve shared. Sean, this has been an awesome conversation. I know, before the conversation, we were joking about how these scores are simple, but maybe not so simple. I think the way you all are approaching this is really impressive and think listeners will find it useful as well.

So thanks so much for the time today. Really appreciate this conversation.

Sean: Yeah, definitely. Had a good time.

Abi Noda: Thank you so much for listening to this week’s episode. As always, you can find detailed show notes and other content in our website, getdx.com. If you enjoyed this episode, please subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Please also consider rating our show since this helps more listeners discover our podcast. Thanks again, and I’ll see you in the next episode.

How Slack fully automates deploys and anomaly detection with Z-scores

Timestamps

Listen to this episode on Spotify, Apple Podcasts, Pocket Casts, Overcast, or wherever you listen to podcasts.

Transcript