In this deep-dive episode, Brian Scanlan, Principal Systems Engineer at Intercom, describes how the company’s on-call process works. He explains how the process started and key changes they’ve made over the years, including a new volunteer model, changes to compensation, and more.
Abi: Brian, thanks so much for coming on the show, sitting down with me today to chat. I’m really excited for this.
Brian: It’s super to be here. Thanks so much for having me.
Abi: Awesome. Well I’m really excited to dive into the, to me, novel approach you’ve rolled out at Intercom, and the journey of how you implemented it. I want to start with learning about the on-call process at Intercom before the change you worked on. My first question is, how did on-call or support work before you ever had a formal on-call process? I think I saw you talking about how your CTO was the on-call team at the beginning?
Brian: Yeah. I think like any startup or any company who’s started from scratch, things just started off incredibly informally. Our CTO, Ciaran Lee, and co-founder of Intercom, he was the operations guy at the start of Intercom’s history. It was him who was in the weeds of what originally was the Heroku setup, and then later the AWS setup. Was in the databases, in the caches, everywhere, kind of keeping things running. Over time, though, Intercom was successful. The whole chat messenger thing took off. Things scaled and grew around Ciaran. Again pretty informally though at the start, but then we started building operations teams and other product teams. It was no longer just the team, the R&D team or whatever.
Individual teams all started springing up. Like the strong culture of ownership in the company, the teams naturally gravitated towards doing their own on-call, solving their own problems, and figuring that out independently of each other. It definitely wasn’t structured or designed in any serious way. It was teams just taking on the work for the areas that they owned. As Intercom grew and scaled, and the number of product teams went from one to two, to three and four, et cetera, et cetera, the areas of ownership, the areas of stuff that teams will go on call for or would take responsibility for would grow with the number of teams.
There was no great plan there. It just kind of organically happened. The bad side was then, though, things were unevenly distributed. you had a bunch of core infrastructure I guess, which was owned by the team working with Ciaran as the early operations style team. Then there were some of the product teams as well, who ended up just building a bunch of stuff, and would operate it and would go on call for us. PagerDuty rotations were set up that were 24/7. People would start to get paged for things that would break. Then other teams wouldn’t as well. They would just build a bunch of stuff that really wouldn’t page and stuff.
Abi: That’s interesting. We’ll get into all the specific challenges that you guys discovered as you guys grew and scaled. I’m curious in the early days, the sort of informal on-call process. I don’t even know if you guys were calling it on-call at that point, but who was in charge of that? Was it your CTO was centrally overseeing the distribution of those responsibilities? How was that being led?
Brian: I don’t think it was centrally governed too much. Like I mentioned, setting up PagerDuty, and getting it up and running for a team. Hooking up a few alarms. Really that was kind of the extent of the organization that went into it. There wasn’t a shared or common approach to things like runbooks, or even what a good alarm looks like or anything like that. Teams were largely left to their own devices, or would just do enough. They had a lot of other things to worry about in the early stage of a startup. So teams independently would see what was working elsewhere, maybe have a few conversations with people. Set up a few alarms, and then go on call and see what happens. There was no governance function. There was maybe some common techniques and stuff that people would share, and would end up putting in place. But there was no centralized function, other than maybe a bunch of conversations in the hallway about things.
Abi: I know maybe at some point, in between just your CTO doing on-call and these product teams doing on-call, you also built sort of an operations team that was maybe supporting your CTO. What was the relationship like between that operations team, and then these product teams that started to take on more of the on-call on their own?
Brian: Yeah. That’s an interesting question, because it changed over time as well, as we were figuring out Intercom’s culture and technical environment and architecture. I guess in the beginning, to scale out the operations function, we needed a bunch of people who were more leaning towards the system side. So we built an operations team, and that was great for just getting up to speed on scaling and resilience, and making sure that we had the availability and ability to get ahead of where the company was growing. We were going through explosive growth at the time. So there needed to be a lot of work put into scaling, or then MongoDB clusters, or our AWS environment and things like that. A lot of the on-call work did fall on this central operations team, who effectively took care of a bunch of our core infrastructure, our shared infrastructure.
As it happened then as well, we went through a few phases of discovery, of figuring out what architecture is best for Intercom. When I joined Intercom, it was kind of with the understanding that we were going to get away from this Ruby on Rails monolith. Start to get all grown up and professional. Build out a bunch of microservices, and get all of the amazing benefits from that. But it turned out in practice that this didn’t actually work for us. There was definitely many benefits, and it can be very appropriate in many environments to build like this and to solve many problems, but we had a relatively large Ruby on Rails monolith anyway. Then when we built out additional services for new features and things, it turned out that our teams were no more productive, in fact less productive with these services. Also at the same time, Ruby on Rails, both at the Ruby level and at the Rails level, just became a lot more scalable. We were able to run what we considered to be services that were higher throughput than we thought were possible previously in Ruby on Rails.
Over time, as our culture evolved as well, and our architecture as an output of the culture, we ended up falling back onto a Ruby on Rails monolith with a strong amount of centralized control, or I guess like a platform. Where most of our teams don’t run their own services, and don’t run a lot of independent microservices in AWS and whatever, and build on top of the Rails monolith that was well run and had great observability. Just other things that really aided both the development and the operational process of running these things. As those things changed, or as we figured out and felt out who ran what at Intercom, and what was the optimal architecture, the responsibility for on-call shifted around as well. Once we knew that we were going to maintain this centralized well-run bunch of services that our teams run on top of, we then decided to invest even more in this. Really go deeper and get more out of it, because the return on the work in those areas was so high.
So we ended up building teams to specialize in say databases or our core software, our frontend technologies, and our cloud technologies as a platform group really for the rest of the organization to build on. What ended up happening was… For a few years, teams built services, were on call for those. Then as we centralized and evolved our architecture, we ended up turning off a bunch of those services, re-implementing them in our monolith. Not necessarily centralizing all on-call forward, but more building on top of the shared components. So the responsibility of on-call shifts around to where we had product teams who were on-call for pretty much the features, but then a lot of the underlying infrastructure and stuff was on call by I guess a platform group.
Abi: I was cheering in my head as you shared all that, because we have a lot of conversations on this podcast with leaders who, they’re using Kubernetes and microservices. It’s so complex that they’re building really highly complex platforms on top of those platforms, abstract all the complexity away from engineers. I’m personally a big fan of Rails monoliths. That’s what we had at GitHub, and a big fan of companies like Shopify as well, so that’s awesome to hear. I know I’ve seen you confess publicly that you personally love being on call. Can you share more about why that’s the case.
Brian: Sure. I mean when it comes down to it, I’m kind of a Unix systems administrator by trade. I think when I was in college, I got into running Unix systems as part of college societies. We’re around say 1997, '98, so the idea of providing the ability for people to run webpages, and to participate in IRC servers and things like that. That was all fun. It just started off, really my first introduction to doing anything serious for computers was providing services and building stuff for people to use. I just got into a knack of enjoying fixing things, and well building things I guess. My first job at a college was in Unix technical support, and I wanted to be a Unix system admin when I grew up.
Then as you get out of the mode of just building simple stuff, or as I guess the use of computer systems has scaled and grown. I’ve had a lot of enjoyment or satisfaction, and indeed career growth through building and operating and being part of larger and larger systems. I’m not sure if I’m particularly talented at being on call, but I’ve done a lot of it. I find that’s a very good way of getting half decent at something, is to simply do it a lot. But also I enjoy the impact of on-call work. It’s usually where things start to break first, or where you can see problems. Everything, like Charity Majors says, is a socio-technical problem, and on-call is where you can uncover or start to pull some threads. See maybe where there are parts of your environments that have been under-invested in, or blind spots in terms of features that are being used in ways that you don’t expect.
I also get a lot of kick out of just fixing things for customers. It is satisfying to have an impact in work and to do things that matter. If you’re on call and on the hook for fixing the things that your customers use, that’s pretty satisfying as well. There’s a real short loop in terms of the time that you do your work and you can see the impact of it. It’s pretty compressed when you’re on call in a situation where something is broken, or nearly broken. I don’t think everyone should have the same obsession with on-call that I do, or that absolutely everybody should pin their career to it. There’s definitely more than enough different shapes of engineers, and more different types of workaround that there’s plenty of other stuff to do, but I’ve found it particularly fulfilling and useful in my own career as well.
Prior to Intercom, I was at Amazon. I ended up building a reputation for being half useful on outages, and building a reputation through being a call leader. Participating in their programs, and helping out on their availability programs as well. As a gateway into having influence in an organization, and being able to get stuff done and having customer impact, and selfishly just getting a reputation for being able to get useful work done, I found it pretty useful. So I’ve kept that going, and try to see how we can do it at Intercom. Try and grow… Not just do a great job of making sure that we do a good job of on-call, but also help grow people’s careers, and make sure that it’s a fulfilling thing for people who aren’t just me.
Abi: I completely agree with what you said about the impact, and developers being able to see the cracks in the system, the socio-technical factors like you mentioned. What’s your advice for leaders who may be struggling to get their engineers feeling excited and motivated about being on call for the same reasons you just shared? What advice would you give to leaders to convey that to engineers to get them excited?
Brian:Yeah. I think it’s really important to make sure that the work is valued. That it’s not just seen as something that people do on the side, or that is almost just part of the job. If you’re running services, you’re running stuff your customers are billing on top of. The delivery of those services is the job. So the on-call work and everything that supports the on-call work really needs to be an extremely high priority for the business. That has to be recognized in terms of how leaders recognize the work, help prioritize the work. It’s a bunch of soft things, and a bunch of actual real hard prioritization decisions as well. There could be money involved as well. There’s everything from saying kudos or thank you, say going into a holiday period, and you know that there’s going to be a bunch of people on call. Or if people have particularly crummy on-call shifts, or there’s a time of instability, stepping up as a leader to give credit for that kind of work.
No one wants to see heroics, but there are times when it’s more stressful or more impactful than others when this work is happening. So as a leader, just recognizing that people can have a tough time, or that the work is real, can help. But then there’s all of the rest of the work that goes into really delivering effective services, such as building on top of solid foundations, being on top of alarms. Making sure that day-to-day operations work isn’t entirely ignored, and balanced against product development and new feature development. As well, I think compensation and really paying out either in time or in dollar amounts for the time spent that people do on call.
Abi: It sounds like the takeaway here for leaders is, one, if you want engineers to feel excited and motivated about being on call, recognize that work and make it feel important to the business. Number two, make it nice. Like you were saying, have solid foundations and processes. If on-call is just a chaotic and stressful experience, of course engineers aren’t going to be excited about it. Anyways, I think this is a really good transition into the next part of the conversation. I want to dive specifically into the challenges you guys started seeing. I think many listeners will relate to these, and it’ll be helpful to frame these problems before we talk about a lot of the improvements you made. So I want to first ask, and you alluded to it earlier, around what point in Intercom’s growth did these issues start becoming really apparent?
Brian: Intercom started up around 2011 or so, 2010-2011. I joined around 2014. I think I’d been at Intercom for maybe about two years in the fast-growing, spinning up lots of teams, spinning up lots of services kind of environment. We really started to see things creak around 2016-2017. We were kind of out of that early stage startup, and starting to have more established teams, and having to have lots of customers. Relatively on in terms of the maturity of the environment and the company. So I guess we were probably about six or seven years old. The problems we had started to see after this explosive growth, one thing that stood out was we had too many people on call. I mentioned that I was at Amazon before. I’d seen some of the world’s largest services be run by a lot less people.
So I was looking at Intercom going, “There’s five or six people on call, all the time, every weekend. Intercom’s not that big, compared to some of these services at Amazon that really had global scale, but only maybe two or three people on call for us.” That seemed disproportionate. Another thing was that we didn’t have the same quality of on-call practiced by the teams and by the individuals who were on call. Not faulting the individuals here, but I mentioned that the on-call burden was unevenly distributed. So if you were in the operations, the core platform team, you knew that if you were on call, you were probably going to get paged that weekend. That there’s going to be some database that falls over. Maybe some customer decides to do something surprising. So you were going to carry around your laptop, and make sure you were no more than 15 minutes away from being able to get online. Taking it as seriously as an on-call shift in Amazon or whatever.
But then there were other teams who, even in office hours and during active developments, they may only get pages once a month or even fewer, but they were still on the hook for responding to these things. It was still going to be their phone that gets called by PagerDuty or whatever. This wasn’t that satisfying. We could see that there was a uneven experience. This uneven experience as well was a barrier to us having fluidity, in terms of how we move people around the organization. One of the things which has I think contributed to Intercom’s success is that we have been quite fluid in terms of team makeup and ownership.
We reprioritize frequently, and there’s a bunch of side effects. We do a lot of team renamings, and people move around a lot. But it’s actually one of the things that allows us to keep a lot of focus, and make sure that teams are working on the right stuff with the right level of autonomy, and not getting pulled in too many directions at different times.
Abi: One thing you talked about at the beginning was that you had too many people on call at any moment. I want to double click on that, and ask what you mean by that. Was this just because you had seven or eight product teams, and each team had to be on call? Is that why you ended up with too many people on call, in your view?
Brian: I mentioned that we had an ownership culture, and teams basically went on call for the stuff they built. When we looked at the number of individuals who were on call, it was way more than I would expect for a company of Intercom’s scale or size at the time. Comparing to my time at Amazon, there was global scale services that had fewer people on call compared to Intercom. So the numbers just didn’t make sense. It was like Intercom, it was successful and doing well, but also we were built on top of the cloud. We were pretty resilient. At our scale and complexity, we didn’t really need that number of people to be on call. Also, the burden of work was distributed poorly or unevenly across the different teams.
Some teams would be fairly confident, or would know that they would get paged quite frequently, especially at weekends and other times. Other teams would be paged quite infrequently. This led to some teams really just not taking on-call as seriously, in terms of carrying their laptops around with them all weekend, and knowing that they had to respond quickly to things. Other teams just knew they had to get on call and get online, because if they didn’t, Intercom was down. As Intercom matured and grew, we wanted to have a more consistent experience, especially for customer-facing features.
You just don’t want it to be down to the almost chance of which team happened to get paged for a problem. Make sure that the people who get online to respond to the problems are available and ready to do so. While that was definitely one of the problems that we wanted to solve around the consistency thing, it was more… I also brought in some baggage I guess from previous in my career, where I wanted on-call to be something that people actually enjoyed doing. It wasn’t something that just happened to be an attribute of, or something that was related to the team that they just happened to be on. We felt we could do a better job that way.
Abi: That unevenness you described between the different product teams that were on call, I imagine that’s just because some of the products were your flagship products with the most users, just the most tech. I imagine it was just a natural result of the distribution of users across those products?
Brian: Yeah, and some of the complexity in them. For example, our team that would be in charge of email delivery, they ended up, they would interact with a lot of third party services. They had some fairly complex queuing systems, and their own databases to process this work. So that ended up just being a large amount of stuff, and all that stuff can break. Whereas say a team who would have an equally important feature, like the inbox that powers Intercom, that would be largely built on top of shared components and some of our core databases. So the on-call work there might be more in terms of bugs that customers might run into, but chances are they’re going to run into them during business hours anyway. So especially during out-of-hours times, the burden or the difference in the work changes dramatically.
Abi: Yeah. Another thing I saw you write on the post you published on Intercom’s website, you said there appeared to be a general level of tolerance for unnecessary out-of-hours pages. I wanted to ask more about that. You just talked about how you wanted on-call to be a joyful, delightful experience. When you say that there was a level of tolerance for unnecessary out-of-hours pages, do you mean there was a tolerance amongst engineers, or a tolerance from leaders in the organization of the demands they were putting on the engineers?
Brian: Honestly, I think it was more the engineers. What saw was we had a number of pages, a number of noisy pages, things like CPU alarms for databases. Maybe things that were put in place that aren’t necessarily a strong sign or indicator of customer pain, but something that might be useful to page on. So we ended up with a lot of alarms of mixed quality. Sometimes they’re useful, sometimes they’re not. What we saw in practice was, people were almost happy to get over-paged. They’d rather be paged on something just in case. Take a look at stuff, and then decide actually it’s okay, no one needs to do anything here.
So maybe we had a lack of confidence in the system, where that was kind of built in to the way that we were thinking about it, or the way we were working at times. Maybe being in a high growth environment contributes to that as well. What you don’t necessarily know which way things are going to scale or a going to break in the future. But it was definitely, on an individual team level, we looked at the quality of the alarms, the necessity to wake somebody up for it. It kind of wasn’t there, but we also didn’t see from the engineers pushback on, or the confidence maybe to turn these things off or change the thresholds.
Abi: Yeah. That’s so interesting. It almost sounds like, to some extent it sounds like a good problem to have, but more so a double-edged sword. It sounds like you had such a great culture of ownership, that engineers cared so much that they wanted that extra sort of effort to be put in to on-call. But at the same time as an organization, as leadership, it sounds like you guys had concerns that this was unnecessary, and people would get burnt out. We’ll of course talk more about that later. I have one more question about the challenges you were facing at this time. I saw you also talk or write about the fact that, at this point in time, only that original operations team that you talked about earlier actually had any form of compensation for doing on-call. I’m curious, how did you guys arrive in that place where only that operations team had that compensation? It was just sort of an organic thing, that’s where things evolved to?
Brian: Yeah. It was quite organic. It was recognized by the team, by the manager at the time that their work was significant, the workload was significant. That we wanted to recognize it in a way which wasn’t simply time off in lieu, or to accept it as being part of the nature of the team. The additional compensation came out of that need of seeing just this constant amount of on-call work that the team did. Recognizing that through compensation seemed fair and consistent with I guess our overall principles around compensation. Recognizing how people work, and when they should be compensated for it. But it wasn’t something that we tried to solve across the entire org at the time, simply because of the distribution of work probably didn’t need it, and the other managers elsewhere weren’t screaming for it either.
Abi: Well, on-call compensation of course is sort of a hot topic right now. We’ll come back to it later, but I now want to shift to talking about the changes you made, and how you made them. I’m really excited for listeners to get to hear about your approach. Before we get into the actual changes you made, I think it’s important to recognize that changes like this are hard to make at any organization. So I first want to ask you, how was the spark lit? Who said, “Let’s change it.” Was this something that came from the CTO, or was this a bunch of on-call engineers who got together? Who initiated this to begin with?
Brian: It did come from a number of sources. There wasn’t just one individual. It certainly wasn’t just me or anything like that who decided that we need to fix this problem. I guess there’s a bunch of things in our culture that led or enabled a bunch of this, and made us feel uncomfortable about the status quo at the time. One was, we wanted to have high standards with how we work with each other and treat each other. Also, Intercom did and still does have a high degree of respect for people’s work-life balance. We want people to work hard during business hours, get the work done, and consistently be able to produce high quality work, rather than do big crunches and get stuff out the door.
The idea that we would have a handful of teams or a few places in Intercom where we had this significant on-call burden that was really impacting people’s lives, it wasn’t satisfying, it wasn’t consistent with that. It wasn’t a spectacularly high bar that we saw around this part of shared work across the organization. We also wanted to be able to move people around teams pretty easily, and not have to take into account their people’s desire or willingness to do 24/7 on-call, which can be pretty stressful and stuff. We wanted to decouple the work of keeping Intercom online with where we deploy engineers to work on different parts of the company. Keep maintaining the flexibility, being able to reorganize without having to think about on-call, that was a factor.
Wanting to have a high bar, and being consistent with I guess our culture as a company as a whole. Then just having a high bar to the overall quality of the work as well. We wanted people to be doing the highest quality work of their career. We wanted to be able to serve our customers well. We were seeing a budget of inconsistencies in the engineering experience, but also the customer experience of Intercom as well. So it was clear we were going to have to do something to level up. But I guess we felt ambitious enough, and that our culture supported to do something relatively unique and centralize things. Try and solve this together as a team, as an organization, rather than having a single team to fix it.
Abi: It sounds like there was a lot of obvious reasons why improvements needed to make, and a lot of support across the organization. I want to just dive right into the approach that you all came up with. I’ll just take some lines out of another interview you did. You decided to create a new virtual team who would take all out-of-hours on-call. That team was made up of volunteers, not conscripts, as you called it, or people who were just assigned. Engineers would rotate in and out of this virtual team after around six months. I want to first ask, how was this idea born, and how was it initially shared? What were the biggest concerns people had with this right off the bat?
Brian: Yeah. We built a team of engineers, like a working group largely made out of engineers, but with support from leadership to work on the problem, and to figure out and design a solution. The working group itself was taken from across the company, across the organization. So they had the scope to think of this at an organization level, rather than just on an individual team level. The idea of effectively turning that working group into the on-call team for Intercom kind of came naturally out of that. To be honest, we probably designed it a little bit that way to get some sort of outcome like that. The ways that made it successful I guess, the things that we started to do to put things in place was that we started small. We didn’t try and take every single alarm. We didn’t try and fix all the problems in one go.
We just incrementally took in alarms and responsibility, and rolled it out slowly. That gave us confidence, and the ability to try and learn as we went along. The initial design, I’ve looked back at the original document, we got loads of stuff wrong, but the original idea was roughly correct. But the kind of problems we were solving and some of the main approaches have stayed true and have served us well. That ability to iterate and not get too stuck on the design served us well, and has ended up making it sustainable and a success over time. Some of the problems that we were worried about was definitely ongoing membership of the team.
Maybe we’d staff it for six months, and then everyone would go away and the team would fall apart. That was a risk. Also, the complexity of the environment. Not everybody knows everything in any company, but especially not in Intercom, and especially not in a company that was growing and building more stuff. We didn’t know if one person would be able to do the job. That was something we would have to learn. Maybe also, there would be so much work to do that we would make the problem worse. That the person who was on call, the victim would be so overwhelmed with an incredible volume of pages. Their life would be really impacted by it, that it would be almost cruel to do that. They were all big concerns going into it.
Abi: I’m curious to ask, what was an example of an assumption you got wrong? You mentioned you went back and looked at the original design doc for this.
Brian: Yeah. When we were rolling out, we kind of had a carrot and stick approach to bringing alarms over from the teams. We suspected that maybe we’d bring some alarms over, they might be low quality, teams might… When we get a page out-of-hours, we open a high severity issue. We document what was done, we follow a runbook. We then expect during office hours for the team to follow up, and tune the alarm, or fix the problem, or take some action. We were worried that teams wouldn’t take any action. That they would tolerate the annoyance or the interruptions that our volunteer team was taking on their behalf. But what we saw in practice was that teams were not just as responsive as if they were paging their own team, but more responsive.
I think paging somebody out-of-hours who isn’t on your own team seems to have a higher degree, a higher weight or more guilt involved. So we saw teams very quickly fix up their alarms, their runbooks, and really respond very quickly to the stuff that was causing pages to come out-of-hours. That was something that we thought we were going to have to do a lot of active management, and really using the stick of sending alarms back to the teams, but in practice, we just didn’t. People really bought into it, and were happy to fix things up, or urgently fix things up to prevent people being paged out-of-hours.
Abi: A question I would personally ask in a situation like this, the volunteer model on one hand sounds really nice. On the other hand, was there concern around the people who weren’t going to volunteer? That it would be inequitable in some way, because not everyone would participate? Were there concerns around that?
Brian: Not particularly. We had a core group of people who were interested. There’s people like me, who actually like doing on-call. We also recognized that some people don’t like that work. Or some people have young families or other responsibilities to their communities, or otherwise that just makes it pretty normal to opt out, or to want to do your work between 9:00 to 6:00. That’s your work time, and then not worry too much about the state of the servers or whatever most of the time. There hasn’t been any backlash against people who don’t volunteer, or there have been a perception that there’s people who do opt in or opt out.
It’s more that we recognize and encourage the work. We recognize that the work is real. It’s something that kind of boosts I guess an engineer’s reputation or contribution to the company, more so than being something that detracts from people who don’t opt into it. We have plenty of engineers. Some of our best operational engineers who during office hours are some of our best folks at hopping on problems and fixing them and stuff, but just their personal choices are that they prefer not to do that type of work out-of-hours. It hasn’t held them back in any way in progression or in other ways at Intercom.
Abi: Well we’ve talked so far at a high level about how this new design and solution looked like. I want to dive more into the details now. Earlier you talked about how having this sort of homogenous or simple monolithic tech stack is kind of an enabler for this type of a process, and that that’s intentional. I’m curious, again I mentioned earlier there’s a lot of these companies that are struggling to abstract away all the complexities of infrastructure from their engineers with tools like Kubernetes. At Intercom, you did mention there are specialized sort of systems teams. How confident are engineers in understanding the full stack, and even working with it?
Brian: Sure. I guess one thing is that, for our out-of-hours on-call team, we don’t expect people to be experts in all parts of Intercom. The idea is, you don’t have to do a huge amount of training, or have in-depth knowledge in every single data store we use, or in every single feature. We consider the on-call engineers to be first responders, to be able to apply first aid, to be able to triage. What that means in practice is, follow runbooks. Use whatever understanding of Intercom that they have to respond in a way that is fairly standardized and well understood. If they can go further than that, great, but we don’t expect them to solve every single problem. We set those expectations very, very clearly upfront that they are first responders. We will give them the documentation they need to make some progress, or make a good start, or solve the common cases, but that’s as far as they have to go.
We also then provide them with an escalation. It started off quite informally at the start. I think it was a few managers maybe, or a few leaders who were able to be paged by the on-call engineer in case something went really bad, or they weren’t able to make progress. We’ve since formalized that into an incident commander program. It’s also another volunteer out-of-hours oriented program that ensures that we’ve got a consistent response in terms of responding to customers, updating our status page, and making sure that large scale events are well managed. That’s been formalized over time, but it’s always been there as a way to help out any engineer who’s on-call, who for whatever reason just needs almost like another pair of eyes, or needs help escalating something or bringing in other people.
We also have always had in the design that you should be able to bring in people who aren’t on call. There’s going to be weird problems. You might need to bring in some subject matter expert. If something goes wrong with Elasticsearch, we’re probably going to have to bring somebody in who knows something about Elasticsearch. Or in other cases, if something’s blown up that was being worked on that day, then we should bring in that person who was working on the thing that day. While we do have I guess the designed process of having alarms going to an individual, and then an escalation process in there. In reality what we’ve seen is a lot of ad hoc management of who is brought in to look after an event, or to escalate something weird, or something that the on-call engineer can’t fix is done.
It can be sometimes whoever is online, whoever’s in Slack at the time of an alarm going off can help out. In other cases, have to go to individuals to fix different things. But in the vast majority of the cases, the information that we give in terms of runbooks and the homogenous setup of how most of Intercom features are built, those two things combined are typically enough to make sure that the vast majority of alarms and problems can be resolved by one individual. So somewhat it’s accidental, it’s just the nature of our architecture, how we’ve built, what our culture is. But also setting expectations correctly and giving them ways to escalate, has de-stressed or made it a lot more practical as well for one individual to go on call.
Abi: It sounds like just having a button-up process and good onboarding for on-call engineers is a big part of enabling people, as you mentioned, of varying levels of expertise to feel confident in that role. I want to also loop back to the incident commander role in a moment, but first I want to continue discussing just the overall on-call experience. You talked about how joining this team can boost an engineer’s reputation, this natural appeal and incentive to join. How do you actually logistically do this? Is there some sort of draft every six months, and do people apply? Then is there a formal onboarding or training program? Is there someone leading that? What does that intake flow look like?
Brian: In the past, it used to be that somebody would just ping me on Slack, and I’d add them to the list. It has changed. In the early days, we used to have a lot of meetings. We would do a weekly meeting of on-call. We would review every single issue. If you wanted to join the volunteer team, you would first join that meeting and see the discussions. By reviewing every single issue, and by going through what the engineer experienced, we would share with each other the common cases. You would see a lot of the common problems through the eyes of the engineers who had just responded that week to the problems. That was a big part of our onboarding process. It was kind of heavyweight, so it would take a lot of time, a weekly meeting with a bunch of engineers at it.
So we changed that up over time to have a more self-serve process, of where we wrote up materials, and provided a way for people to go through a little bit of a curriculum at their own pace. We never had too much of a shadowing system, or ran too many game days. We tried a few different things out. But honestly giving people the opportunity to go fully on call when they’re comfortable, by following our self-serve training guides, I think that’s largely worked out well for us. In terms of staffing the team, first of all, we always worried that people would leave at a higher volume than people would join. But we always tend to have a bit of a wait list, and we’re happy to take in people from any part of the company as well, like if you’re a frontend engineer or a data engineer.
We think of things in a full stack way in general at Intercom, and we’re happy for people to maybe do a little bit of growth, and get their hands on different systems that they mightn’t be too much in day-to-day. As long as people are comfortable themselves, and can see the nature of the work, we’re comfortable with putting people on call in these different things. At the start, we made sure that we had people who were going to start. We didn’t announce the team, and then wait for people to show up. We’d already made sure that we had a full roster of people who were ready to go. What we have a lot of these days as well is people returning. We’ll move people out after six months. Then people will come back to us a year later, and say, “Hey. I want to do more on-call,” or whatever. So we keep it fresh with new people joining, but we’re happy to have people leave and then come back as well.
Abi: I was laughing when you shared the strategy of staffing the team before you announced that you were recruiting for the team. That’s phenomenal and inspiring that you have a wait list now. I was going to ask that question. I was wondering if the team was growing proportionately to your org, or if participation had been more flat. It sounds like it’s been a huge success. I want to ask you about that incident commander role in this process. You alluded to it and described it at a high level. How did this come to be, and how does that role work?
Brian: Sure. It didn’t come directly out of our out-of-hours on-call experience. We always had an escalation available for the engineer who was doing on-call. Honestly the workload wasn’t that high, and the ad hoc approach didn’t really have many problems. What we did experience though in Intercom’s history, as we grew and the complexity of our incidents grew, we got to the point where we had some pretty complex large scale company-wide events. Or availability style events that needed a bunch of different activities to happen inside the company, a bunch of different teams to do a lot of work. We needed somebody to really take responsibility. Be a single point of contact, and run the incident and run the event itself. I don’t think it’s that controversial a design. It’s certainly not as novel as I think the volunteer on-call. I think the idea of having a single person who’s managing and running an event, it’s pretty common in the industry.
Like I mentioned before, I participated in Amazon there as well. The main purpose of the incident commander role at Intercom is to ensure that we’ve got the right people on the incident. We’re not just waiting around for people to show up. We’re not waiting for people to check in with whatever work they’re doing. A lot of it is just management of what’s happening. Knowing who’s doing what, making sure that people are coming back with what they’ve done on time and stuff, and that the job of communicating internally is done well. That’s something that we found hard to just get right without a bit of organization and structure in place. Updating our communications team or sales team or support teams, all that stuff is tricky when you’ve got an outage that you’re trying to manage as well.
Abi: Absolutely. Well a sort of similar technical change that you described was moving your alarms, defining them in code using Terraform modules, and having those go through peer review. Can you talk more about this, and perhaps in layman’s terms just for people who aren’t familiar with Terraform? What does this mean, and what problems did it solve?
Brian: We were dabbling with infrastructure as code back in 2017 or so. Starting to define bits of our AWS infrastructure using Terraform. Terraform has its own proprietary language, where you can use… It basically stays in configuration what the desired infrastructure you want to be created, whether it’s a bunch of easy two hosts or SQS queues. But also, you can not just define infrastructure, you can really interact with APIs in any way, and get things into some sort of desired state. So we ended up building out, or had the idea of building out our alarms using Terraform. We’d suffered from some problems, where people were using UIs to define alarms that were paging people. It felt weird that we would have such high quality controls over the codes that we would push, but then when it came to things like pretty critical alarms, there was no oversight.
People were, “You go in and do whatever you want in the UI.” There would be no oversight. There would be no review necessarily of what changes were made in there. So it was hard to keep a high bar, and to keep things consistent across our endpoints, or the things that we knew we wanted to monitor well. Also, the two vendors that we were using for alarms at the time, Datadog and AWS CloudWatch, they didn’t make it easy to be able to review at scale, I guess, the different endpoints and the different things that you wanted to have in place. By putting this stuff into code, it allows you to run Unix text processing tools against them. I want a grep for this thing. It’s going to be far faster to process.
You can programmatically generate these things as well, and make sure that they’re consistent, and they always exist. Run linting against them, and make sure that they’re up to a relevant standard. Applying a bunch of the code quality techniques and processes that we were using in our day-to-day, in terms of managing our code base against managing our critical alarms as well. So we ended up using it as a bit of a forcing function for our on-call team. We just happened to be doing this project around the same time. To get an alarm over to the volunteer on-call team, we made sure that the alarms had to be in a certain place in a GitHub repository. By forcing the peer review, and even just using the likes of code owners, we made sure that a centralized group of people were reviewing these things.
That they were quality controlling the runbooks that were attached to the alarms. The runbooks were also checked in as part of the same code base. So having everything in one place was this like forcing function for quality control. But also the act of having to move alarms from one place to another, and us giving clear guidance to our teams around what a good alarm looks like, meant that we ended up throwing out a lot of our alarms. The ones that remained were in a lot better health as a result of this. Having to push it through and get it into this GitHub repository, the least interesting part was putting them in GitHub almost. It was like the process to get them there just led to a far higher level of quality.
Abi: Well, thanks for that in-depth explanation. I personally found that really interesting and insightful. Another topic that I know came up in your articles and discussions around this was the on-call compensation, which we’ve already discussed earlier. Of course this is a volunteer model that you rolled out, but you did decide to compensate engineers. I read something where you wrote that you at least at the time had decided on a flat rate for each on-call shift of €1,000 before tax in Ireland. I wanted to ask about this. First of all, is this still the compensation model? Also, with engineers who aren’t based in Ireland, is there adjustment of that based on geography? Yeah. We’d just love to learn more.
Brian: Yeah. I’ll answer the money question first. We did have adjustments for US and UK based engineers. We didn’t think too long about it, to be honest. We picked nice round numbers that just roughly made sense. I think €1,000 for a foreign on-call shift, without it being linked to the number of pages you get, or without actually having necessarily to do work, I think it’s slightly generous compared to what I see elsewhere. I think we did want to decouple it from the number of alarms you deal with, or to not have it gated by say like an hourly rate or something like that. We wanted to recognize that if you’re on call that’s disruptive. That you need to do things like carry your laptop around, change your weekend plans, not go swimming, not hop on a plane to Barcelona. These are real things that impact your life. It’s not just the job.
The work and the stress of being on call isn’t just the moment you get paged, and fixing up the problem. It’s the knowledge that that can happen at any time, and that kind of impact on your life. So we found it personally important to recognize that. Also I think there’s kind of weird problems that can be introduced when you link on-call compensation with being paged or with time. We didn’t want that to be the thing that we wanted to optimize for. The easiest decision was to simply have a flat rate, regardless of whether you’re paged or not. It makes things simpler. You don’t have to count up a number of engagements or anything like that, but it also recognizes the overall impact that being on call has on someone’s life. I know even today, despite claiming that I like on-call, and despite having done it a lot over the last decade or so, I still don’t seep as well when I’m on call, no matter how much practice I have with it.
Abi: Well, some people listening to this may be thinking, “Wow. That sounds like a great model.” Others may be going through trying to figure this problem out right now, and appreciate this advice. You mentioned that your approach feels generous compared to other models you’ve seen. You talked about the different ways you can sort of calculate the compensation. Can you just briefly outline some of the other approaches you’ve seen that are different than yours?
Brian: Sure. Time off in lieu, if it’s enforced I guess can be a good way of compensating. You want to make sure that people are rested, if they are engaged in on-call. Taking time off during the week, or getting half days, or whatever’s appropriate for the company. That can be a nice way. It’s a nice way, and is common enough for people to do on-call work, and effectively you get compensated with time off for it. That’s common enough. I think some of the problems are that it reduces capacity. You’re losing your in-office hours, or the time when you want people to be working together and as a team. That can be pretty disruptive. I’m not saying that we want people to be not rested or completely worn out by doing on-call.
We do want people who are on call to take adequate time off. If they have to not show up to the office, then that’s the case. But I think just doing time off on its own can have some problems, but obviously it’s cheap. So that can be a way of getting it if budgets are a problem. It’s something that maybe managers can solve for themselves. Other ways of compensating is getting compensated per outage or per event. There’s problems there around gamification. Also having to track the engagements. It is work to track how many hours people put in. You just know that there’s going to be spreadsheets, and people asking about things.
It is fair in a way to compensate people for the hours. I guess maybe if you have a culture of that, or other systems in work where you’re dealing with hourly elements or time-based elements to your work, or shift-based elements already, then I can see how that makes sense for many companies. But that’s not something that we had any experience of, or any systems to work with. I think those are two of the most common ones, but really we went for the one that I think reflected the way we work the most, and also the easiest to administrate.
Abi: In general, do you think all companies should compensate engineers for on-call? Or do you think there are cases where that doesn’t make sense for the business for certain reasons, or due to the type of process that is implemented?
Brian: Certainly in a early stage startup, I wouldn’t expect it. The nature of the beast is that you are like five people in a room. If the servers go down, then somebody will respond to it. On the other scale as well, if you’re working for Amazon or whatever, and you’re doing 247 on-call, you’re going to expect that’s going to be well built-in or designed into the system. Designed into how the company works and operates and stuff. There’s a scale really from one place to the other. In Europe, I think it’s difficult legally not to compensate in some ways for on-call, or at least it needs to be very well designed into the employment contract. It certainly can’t be as easily rolled out as the process I described. I do think it’s real work. I think it can be recognized in other ways, in terms of performance and people’s impact in the organization. Those things can be as important as the actual monetary compensation.
One thing we were wary of are a risk with their compensation strategy was avoiding people getting too addicted to the on-call compensation. We wanted it to be meaningful that people would know that it exists, and they could go away for a nice meal or something like that. But what we didn’t want was people paying for their groceries, or it being a structural part of their income that they’re dependent on. You have to balance those kind of things, and you have to balance how important things are to the company, and the nature of the job being done. Intercom, it’s important that our service works, but it’s also not mission critical in a bunch of ways that other company’s infrastructure doesn’t work as well. We had a bit of latitude that maybe other places wouldn’t. I think these things influenced the compensation structure as much as how generous we were feeling when we designed it.
Abi: Well it’s clear that compensation for on-call is a complex issue, as you’ve described here. Bringing up problems like gamification or structural reliance on that income, those are all really important considerations that I think listeners will be able to take away. Taking a step back with this on-call process you’ve landed on, do you feel this is suitable for all companies? How does it compare to other on-call processes you’ve personally been a part of? What’s sort of your sales pitch or argument for listeners as to whether they should or shouldn’t adopt a process like this?
Brian: Sure. Back when I was young and naive, say 2018-2019, when we started telling the world about this amazing on-call setup that we put in place in Intercom, I was doing evangelizing. I wrote a blog post and give a few talks, and started having conversations with people in different companies who were interested in rolling out on-call, or fixing on-call in their places. It suddenly struck me, once I started actually talking to people, that a bunch of the solutions I had in mind that I thought were great were actually only realizable, or were just artifacts of Intercom’s particular culture or socio-technical environment. That there were so many other factors involved that to design an on-call setup in any environment, it just needs to be cognizant or conscious of the requirements of the environment. The business requirements, but also the architectural layout and how people work, how work is celebrated. All that stuff has to be taken into account.
Now it doesn’t mean that change is impossible, that you can’t introduce new ideas and new concepts. I think I ended up not recommending specific things that you should copy necessarily, but more taking away some of the approaches or higher level things that we were trying to solve. I think the act of deciding that on-call isn’t something that rules you, or that your operational work is in control and taking control of that. That’s way more important than say even the specific compensation strategy. I think copying and pasting Intercom’s approach probably can work in many companies that have maybe a similar scale or culture or whatever, but chances are you need to figure it out from scratch, and do it with the particular company in mind. So I’m less encouraging for people to copy and paste, like I mentioned, but to have a similar zero tolerance or ambitious approach to be able to improve things. Because that’s ultimately what got us to have a successful setup in Intercom.
Abi: Well your advice to listeners to not blindly copy and paste, but to understand the principles behind what Intercom implemented, reminds me of this conversation I just had on this podcast with someone who had worked at Spotify. He came on to talk about how so many companies copied and pasted the Spotify squad model, and similarly ran into a lot of challenges with that. Hopefully, this conversation and that prior episode will sort of reemphasize to listeners that these approaches, as awesome as they may sound, shouldn’t just be blindly copied. Folks need to really figure out their own solutions, based on their own context and business requirements. I want to ask you, since evangelizing this and writing about it, and doing podcast guest appearances like this one, what’s changed more recently? I know one thing you brought up was that you’d recently spun up an on-call process for your security team. I’d love to hear more about that, and any other more recent changes you guys have made.
Brian: Yeah. We’ve been happy with the volunteer model. It’s gone down well, and in the incident commander area. When it came to designing a setup for security, basically it’s a good starting point. We knew that there was a good chance that people would get behind us. We mightn’t struggle to get people involved and stuff. The problem was quite different though in security. Security events happen, and there are occasions when you’ve got maybe a potential breach, or maybe a customer who’s got some problems, or there’s Log4j that needs updating everywhere on the internet. In those kind of events, they’re special. They’re more like incidents. You escalate wildly, and bring in whoever needs to be brought in. But the security case for us in Intercom, where we’re at at the moment is we certainly don’t have a 24/7 security knock, with people in a room with loads of big screens and maps of the world and all that kind of stuff.
Rather, we needed eyes on bunch of signals, a bunch of inboxes, a bunch of certain types of alarms, that mightn’t be pageable or mightn’t be so urgent that we need somebody to get out of bed at 3:00 in the morning. But you do want to have somebody at the weekend checking that everything’s okay. What we noticed was our head of security was regularly doing this. He had just taken that his job was to check on everything all the time. That’s great and worked up to a certain point, but we started to feel sorry for him. Reckoned that he needs to take breaks from time-to-time, and that this is work that is good to be shared as well across the team. It’s good to standardize this kind of stuff. It means that we get in control of the signals, and understanding what types of security events need escalation and documenting these things and all. These are all good mature things to put in place as you get bigger.
So we put in place a volunteer on-call. Not quite on-call, but weekends and checks, where you do a sweep of a bunch of inboxes, a bunch of signals, and do a bunch of follow-up. I was on call for last weekend. I think I had to send a few emails to a few security researchers, look at a few alarms. It wasn’t like Intercom was hacked, it was more just looking at different signals, and making sure that there was nothing going by that didn’t need more investigation. It’s no fun showing up on a Monday morning when something’s been broken. When there’s been something going off all weekend, or something that was flagged to you maybe on Saturday. So we’re just trying to avoid those. A different design, but still using volunteers, and still using some of the same qualities of the on-call setup that we do for paging alarms.
Abi: Well I forgot to mention this earlier in the episode. Personally, I’m a customer of Intercom, and have been for a long time. Hearing about that, what you just described, was very reassuring. I want to conclude with a topic that we actually started talking about a couple weeks ago, when we met. I asked you what are sort of the big hairy problems with on-call that you’re still figuring out, you see other companies struggling with? You brought up that one challenge is this balance between the customer impact, being responsive to customers and providing a great experience, while on the other hand not burning out your engineers, and having hundreds of people stuck to their pagers. I just wanted to ask you more about that, and that journey of finding the right tolerance level and balance. How have you been navigating that at Intercom?
Brian: Sure. When we introduced our shared on-call, we did introduce higher standards around the quality of alarms. We tried to introduce the concept as well of, paging alarms should not be on symptoms. They need to be on customer impact. Trying to make sure that we’re not just paging on things that might be bad, could be bad, or maybe are signs of something going bad, but are somewhat detached. I mentioned earlier like database CPU levels. These things typically are a signal that something’s gone wrong. But actually if high CPU isn’t associated with a high error rate for customers, or a degradation in the performance, or error rates or whatever, then it’s completely okay to tolerate. It’s not something that you should be paging on.
Increasingly, we’ve been moving more towards SLO style alarms. I guess it’s even a higher level than simple metric based alarms, where we’re looking for inputs or signals from our overall environments that the experience of customers has gone bad, and that then we page on. There is a balance though, or there’s a bit of back and forth on that. You sometimes want to know when things are going to get bad, or on the way to going bad. You don’t want it to be the case that a customer is experiencing really bad stuff for an hour before you get somebody to take a look. But also, you don’t necessarily want to get someone out of bed for maybe just one small customer is having a problem because they’re sending a malformed request to our APIs or something.
That’s something that’s going wrong, and they’re having a bad experience, but it’s not necessarily our fault. It can be hard to separate those signals. Really, we haven’t perfected it. It’s not one thing. You kind of have to do it on a case-by-case basis, examining individual alarms and things that go bad. Looking for the highest level approaches that we can, to recognize those things. Occasionally you have some alarms that you just trust as well, that you know when this thing goes off it’s actually a pretty good signal. Yeah. We try and minimize those, but a few of them still exist.
Abi: You talked about SLOs, and sort of inspecting the process on a case-by-case basis. I’m curious, at a high level, are there a set of metrics or signals that you’re relying on to fine-tune this balance at all times? For example, are you tracking, I don’t know, support resolution satisfaction scores from customers, versus internal developer burnout sentiment? Are there any concrete things like that you track, or are you kind of feeling your way through it?
Brian: We review these things constantly in terms of our availability metrics. We survey our on-call engineers around their experiences as well. We have an availability program. A colleague of mine, Hannah, she runs a program with oversight of all aspects of Intercom’s availability. The on-call, our incident commanders, and internal and external reporting about how well we’re doing, in terms of whether our features are working, and our SLAs and things like that. That’s all centralized there, and we’ve got a full-time TPM who looks after this space. We survey our on-call engineers after every single shift, to see what their experience was like. We tag all of our issues, if there are any issues or problems with the quality of the alarms, or the quality of the runbooks and things like that.
That generates data, as well as closing the loop on making sure that the product teams get action items to do as a result of alarms going off. Between I guess our availability metrics and the data that we have in terms of the on-call experience, and even just the number of engagements we have or the number of incidents that we have, these are all numbers that we use to decide whether to invest more or to slow down development from time-to-time. It’s not as simple as like a single SLO that drives these kind of decisions. It’s a number of inputs, with the experience of our engineers being a large factor in there as well. Obviously, we do look at the customer experience and our endpoint SLOs, and stuff like that as well.
Abi: Well, that’s awesome. I didn’t expect to learn so much when I asked that question. That was really great to hear about how you’re systematically collecting feedback from developers about their on-call experience on an ongoing basis. Brian, this is such an insightful conversation. Really enjoyed the time. Thanks so much for coming on the show today.
Brian: It’s been great. I could probably talk about this for another few hours.
Abi: Thank you so much.