Jack Li explains how his production engineering team rolled out a new incident review process, how they’ve made the case for investing in reliability, and specific tools his team has built to improve reliability.
Abi: Jack, thanks so much for sitting down with me today. So excited to have you on the show, and excited to chat.
Jack: Yeah. Happy to be here. Thanks for having me.
Abi: Well, when we were talking earlier, you shared a little bit about your background and why you’re so interested in reliability, and I thought it was really thoughtful, so if you wouldn’t mind just beginning with a little bit of background on why you’re so focused on this area.
Jack: Yeah. Definitely. So my career started doing a lot of stuff in building products as an engineer. I think it’s a role that many engineers can relate with, where you’re going ahead and you’re building something and seeing people use it. And as I grew my career, I started to learn about scale and all of the things that need to be solved in order for products actually be used by a lot of people.
And so I transitioned. At this point in my career, I was working at Shopify, and I went over to the … We call it the prodeng team, which is the production engineering team. And I joined a team that was very focused around enabling people to build products at scale. And a lot of this is around how do we not just help people build products at scale, but how do we do this fast and repeatable. And so we called ourselves a developer acceleration team, which is how do we make people productive and accelerate their work as fast as possible.
And so a lot of my work was in that realm. And then as the company grew even more, we began to think a lot about reliability and how do we actually take the work that we’re doing for people and not just make them now work faster and be able to build stuff at scale, but how do we actually take those products at scale and have it available.
One of the things I remember very clearly or that stuck with me was the number one feature of any product is availability. And so if the product’s not available, it doesn’t matter what work we’re building, because that work is not accessible. And so much of my work today has come from that and has made me really passionate about how do we take tools that we build and processes and people and make it possible for them and enable them to build reliable products.
Abi: Got it. Well, I love the journey you’ve gone on in this space and really excited to dig into the work you’re doing now. So before we get into it, you’re responsible for reliability of Instagram Reels. Can you share a little bit about what Instagram Reels is in terms of the organization, how it fits into the rest of the broader Instagram organization, and what’s the makeup of the team?
Jack: Definitely. So my team that I’m the tech lead for is the Instagram Reels production engineering team. We are situated in the middle of the organization, and we partner with different software engineering teams who focus on building the product itself, the machine-learning components that help recommend content, as well as the infrastructure teams who build all the supporting infrastructure underneath. And our role in our team is we try to be the glue that connects all of the different teams together and make it possible for us to ship a high quality, reliable product and figure out how do we actually keep this product going day to day and providing the best experience we can for all of the users.
Abi: Well, before we get into some of the specific initiatives you’ve worked on, I want to ask. SRE platform engineering seems like a quickly evolving discipline and field. So one question I have for you is, what was different, at least from the get-go, in terms of the way Instagram is approaching production engineering when compared to your role back at Shopify?
Jack: That’s a very good question. I think one of the big differences from my transition that I noticed is at Shopify production engineering was definitely a role that was very synonymous with being a software engineer there, where people switch back and forth, the expectations are more or less the same. I feel like at Meta and Instagram the role is a bit more defined and there’s a working model that people are used to.
What I found that this translates to in practice is that the production engineers at Instagram and Meta have a deeper focus in the areas that they’re responsible for. And so as I work on my team, my role is I’m not going ahead and building features. In fact, in the last six months I haven’t even written a ton of code. But a lot of the work that I am very focused on is how do we actually apply the mindset of being an SRE and thinking about the platform and processes and being able to turn that into things that are tangible at the end of the product.
I think that’s one thing that is very unique to the work that we’re doing on our team, which is also unique in the company itself, where we are a team that is production engineering slash SRE slash platform engineering, but we’re not really working on an infrastructure component. Instead, we’re taking the mindset that it takes to build a database or a queuing system and we’re trying to apply that to a product. And how do we actually take the same methodologies and build a product and have it operate the same way you would operate a database and stuff like that?
Abi: Got it. So one of the things you contrasted there was at Shopify it sounds like production engineering was a little bit more of a generalist role, whereas at Meta and Instagram it’s a much more focused role. Can you say more about how the working model … You referred to the working model earlier. How it’s different. Is it in the way you engage with the rest of the organization or the goals that your team has? I’d love for you to elaborate.
Jack: So when I was at Shopify, definitely, being a production engineer, your team was tied more towards a certain piece of ownership. Usually it’s a technical component. So we had teams that were responsible for our Redis, and in my case it was a deployment system. At Meta and Instagram we definitely have those teams as well, teams that are responsible for the developer experience, which has the release tooling and so on. But we also have this other thing, which is the kind of role that I’m in, which we call product PE or product production engineering, and that’s when we don’t really align ourselves to a technical system, but we align ourselves to the success of a product instead.
Abi: That’s really interesting. So your team is aligned to the overall success of the product itself rather than the reliability of a specific component or a specific technology that’s shared across the organization, it sounds like.
Jack: Exactly.
Abi: I’m curious, as you’ve looked across other companies and are seeing the SRE role, I’m sure, evolve and the rise of platform teams and DevX and who knows what about DevOps, how do you see these different roles intersecting or moving away from one another? Where do you see the trend of all this going? Is it really different labels for the same type of work? Or do you really see a distinct set of skills and areas of focus as the industry goes forward?
Jack: I think, over the last 10 years, DevOps has become this big buzzword and small buzzword, and then it’s come through everything. I feel like one of the Google videos put it really well that SRE, production engineering, and all these roles are just implementations of an idea that we call DevOps, which in this essence is how do we solve operations problems with code, which is another way of saying how do we build tooling and processes that can scale the ideas behind how do we operate well as an engineering team.
And I think the one thing that all really effective DevOps, SRE, production engineering teams have is not what they build, but it’s why they build it and being intentional about the systems that you create in understanding the kind of behaviors you enable. That, I think, ultimately is the essence of this role. And all the different implementations and things and tooling that teams use are a means to an end, and the end is how do we build the best system. And I think it all comes down, at the end of the day, whether you’re building the product or you’re building the tooling or you’re building the processes … At the end of the day, what you’re creating is the best experience for your users, which translates to the best experience for your company as well.
Abi: Well, I like that, and I like the focus on the big picture, and there’s more than one set of terms and approaches to ultimately achieve that goal, but it’s all in that pursuit of the ultimate goal of delivering great experiences to customers. Before we get into talking about some of the ways you’re doing that within your organization, I want to ask you one more question about the different working model with your team. I’m curious. When compared to your experience at Shopify, I presume there weren’t teams like yours that we’re focused at the product level rather than the component or service level. What are the biggest benefits you see to having a team like yours that is more broadly focused on the overall success of the product?
Jack: Yeah. That’s a very good question. I think, to me, there’s up upsides and downsides. I think one of the big upsides of having these product-focused teams is that you’re able to take patterns that product teams have and really go in and solve those problems. A lot of those problems are going to be very different than problems that we see in an overarching infra layer.
One of those examples could be, at an infra layer, we think about the success rate of systems as one big, let’s say, success rate. How many 500 class exceptions do we get? That is how we think about reliability. But at a product layer we think about the product differently, where just because we don’t have an exception doesn’t mean that a product is working correctly. And so we define success differently at different layers. And at the same time, we can have a very reliable system, but a product that might not actually be giving users the experience that the company intends. Maybe it’s slow. Or if you have a low quality phone, it’s going to crash all the time.
And, I think, when we go ahead and align ourselves with the product as production engineers, we have an opportunity to take the lens in the same way that we address server problems, but apply it to how do we think about this as what the user sees. And I think that the trade-off here though is that it’s really hard to be able to do this for every single product you own. Especially, I think it’s definitely a role that only makes sense when you’re at a certain level of scale. If you’re trying to do this and you have an up-and-coming product or something with very few users, you’re probably not going to get a really high return on your investment.
So even for us at Instagram, we only do this with our very high-priority services. And so Instagram Reels is a big enough focus right now that we’ve justified the investment, but it’s something that teams had to think carefully about, because when you put people in these areas, it means you’re taking away resources from building a more stable foundation. In our case here we feel like we should have a scale where that makes sense. But when I was at Shopify, we were still, I think, at a point where there was enough foundational problems that it probably actually might have been the wrong call to have started something like this at that point.
Abi: That’s really interesting. And you said something that I think is so important in that, which is that … In my view, organizations care about the speed of the engineers and the quality of the products they deliver, and it seems like these days reliability has become synonymous with quality. But like you said, there’s so much more to the actual customer experience and user experience than just how many exceptions you have or even availability. I’d love to ask you, how do you define quality at Instagram Reels?
Jack: There’s a few ways we track this. One of the really big examples we have is in the machine-learning recommendations. Without going too much detail, machine-learning pipelines are incredibly complex. I’m sure they’re complex basically at every single company. And so when you have a complicated system, pieces of it are going to fail. It’s just what happens when you have a lot of moving pieces. What we try to do is we make sure that if a single individual piece of the system fails, we aren’t just failing the entire recommendation system and that we have ways of being resilient to certain classes of breakages.
This is one of those cases where we’re able to cover up some of the failures, but at the same time, we’re not giving you the best experience we could possibly deliver. And I think this is an interesting piece to it, where it’s like we’re not failing, but we’re not a hundred percent successful either. And so we end up in this gradient where it’s like we’re mostly good, but not perfect, and I think these in between states are very, very, very interesting, because just because we’re not failing doesn’t mean we’re not doing the best job. And so we have to break things down into down, degraded, and successful, and being able to not just monitor cases that are down, but cases that are degraded as well and understand how do we shift everything as much as we can to that successful state.
Abi: What are some things you track outside of the systems themselves? Do you track the actual user experience itself and the time it takes for users to complete certain things? I’m curious how you really think about performance and reliability from the standpoint of your customers.
Jack: We definitely have very heavy monitoring systems, and we think a lot about what kind of interactions users are doing on the app. And what we try to do is understand what are the general patterns. Is there certain buttons that people are clicking? Is there certain buttons people are not? And what we do is try to look at this in aggregate in something we call engagement. And so when certain actions drop, that gives us a signal that, hey, we might have messed something up here. Something that could be maybe we had a machine-learning model that went bad, or maybe we know we broke part of the app and we didn’t have the right level of monitoring to track it.
So we are very proactive about monitoring this, but we also treat this as a last resort, where when we can be proactive about how we monitor our interactions, we should have very clear signal on when things break, what broke. But when we don’t have those signals, we have to rely on this overarching safety net to tell us that, hey, there’s an interaction dropping, and it could be indicative of a broken experience. And so my team specifically cares a lot about this part and making sure that if something does break, we’re being proactive about it and not relying on users going on Twitter and telling us that something’s broken.
Abi: I think this will be really helpful for listeners, because I think the takeaway here is you can’t really measure everything across all the systems with the precision that we might need and in the ways that we need to look at them. And using actual user interaction, user engagement with our products, looking at user behavior can also be a signal into whether something is down or maybe just degraded, because if something’s harder to do, people will do less of it in your product.
I’d love to move on to talking about some of the projects you’ve been working on at Instagram. I know one thing you’re really excited about is this new incident review process that you’ve designed and rolled out there. You also mentioned to me that you’d revamped the incident management process. So I want to start by asking … That sounds like a lot. Can you share a little bit about what prompted this and in what order you did these things?
Jack: Yeah. Absolutely. So Instagram itself has a number of review processes that happen in different areas focused on specific products or teams. And we also have a main Instagram central review where we review the highest priority incidents across the whole Instagram organization. When we started in the Instagram Reels space, one of the things that was really important for us was to get a vantage point into what were the highest priority items that the organization was dealing with today.
And so immediately, one of the things that we were very involved in spinning up was the Instagram Reel review process. This was something that we did in partnership with the Reels organization, and what we did here was made sure we spun up a forum for us to be able to talk about some of the highest party incidents that were happening and to dive into why these things happened, how do we prevent similar things from happening in the future, and are there things that we can do to both detect the problem faster as well as drive the mitigation faster.
And we wanted to make sure that when these things were happening, we were learning from them and identifying the long-term trends and where we needed to invest in as part of our partnership to make sure that we had the right systems and processes so that these things don’t happen again in the future.
Abi: It’s interesting that the incident management process came out of the incident review process. I think that’s indicative of the value of having good incident review processes, and I want to talk more about that later. But I’m curious. You came into the organization, and you saw this opportunity. You mentioned getting people engaged in this was probably a challenge. How did you make the case early on for why people should buy into even the idea of making a change in this area?
Jack: I would have to give credit here to the amazing leadership that we have. They were people who identified some of the risks in reliability very early on. And so as you were spinning up this team, there was already a lot of talk and buy-in from the leadership level already into this being a priority. So that definitely helped paved the way a lot into going into here and getting that initial interest.
One of the things that was really important for us coming into this space as production engineers was to make sure that our partners were aligned with our views on reliability, in that it wasn’t a trade off in moving the product forward. One of the ways we were able to sort of talk about this was by taking some of the incidents that we were seeing during the review process. Some of the incidents we were seeing were taking people weeks of effort and had really big effects on the overall product. Very soon we were able to sort of communicate that it was sort of like a penny saved, penny earned situation where by focusing on reliability, we can actually accelerate the organization by saving time and not having to sink so many resources with dealing with these incidents.
Abi: I think that’s such a great approach for making the business case for investment in reliability. I think, of course, it’s an easy case to make when the system’s going completely down all the time and you’re an e-commerce site or really any internet company that is losing money when your system is down. But I feel like in the degraded states and/or less visible down states, it can be a little more difficult to make the case for these types of things. And I think that’s so spot on that there’s a huge cost to the organization in terms of taking people away from feature work that they’re trying to do to address these incidents. And as an engineer myself, I know it’s not just being taken away from feature work. It can also just be the distraction across teams of just an incident happening, even if it’s not on your team specifically. Just incidents are just such a stressful thing for everybody. How did you present that to the organization or to leadership?
Jack: Yeah. I think you touched on a great point here and I think it’s very easy for everyone to kind of working along, and then something big happens and it’s almost like a car crash where everyone driving by just has to turn and look. And I think for us going into the organization, we had a lot of these cases as well where we had, you know, sometimes big fires and everyone wanted to look and, and see what’s going on, and in a lot of positive cases, try to help out as well.
And on the other hand, as we moved the product forward, what we discovered is that we had less and less of these really big fires. Instead we had a lot of very small problems. And one of the things that became kind of hard was how do you understand the urgency around some of the small problems and how much effort did you put into trying to actually catch these?
At Meta, we have this severity rating that we assign to our incidents, which we call SEVs. Raising the alarm and starting a SEV of any of these severities is something that can be scary. Sometimes people are afraid of making the wrong call and creating a SEV for something that might not actually be a real problem. And even when there are real problems, there’s paperwork and stuff that has to be done afterwards in terms of reporting and ownership that some people might feel a little bit afraid to do.
One of the ways that we try to empower these on-calls is to create a safe environment where people don’t get punished for getting things wrong, and we wanna make sure that people feel supported in the process, that they’re not the only person who has to own these incidents. One of the ways that we try to sort of help this is to make sure that along with all of our on-call rotation, We also have a separate on-call rotation, which we specifically call IMOC, which stands for incident manager on-call.
And this on-call rotation is meant for more senior engineers on the team or managers. And the sole job of this on-call is to make sure that incidents are treated with the right level of urgency, and that people who are responding to on-call issues are receiving the support that they need. And so members of those rotation will be very familiar with our SEV guidelines, and understand how to assess the impact of the different SEVs that are happening and where they fall on the severity scale. Their job is also to make sure that on calls are not left alone, and that if there’s escalation required to a different team or someone with specific expertise, they’re the ones picking up the phone and helping make sure that the right people are in the room.
They’re also here to make sure that if on-calls do get burned out, we have the right people filling in and that we’re not just relying on a single person on a Saturday afternoon to do all of the work. And so this is one of the ways we’re able to help scale this process and make sure that we have a sustainable way of driving solutions to problems.
Abi: Is there also just a challenge in observability of systems? And I know we talked about that earlier already, but is the challenge there in observability to actually capture the actual incidents that are occurring?
Jack: There’s definitely a very big challenge in observability of large systems. Part of the challenge is not really about capturing the actual incidents, but more about taking the actual incidents of the data you capture and turning that into valuable insights. I think one of the things that happens at a higher level of scale is that you have millions of users, or hundreds and millions of users, or even billions of users, and there’s going to be people who use your product in unexpected ways. They can live in countries that your engineering team is not as familiar with. And as the product gets more sophisticated and you have more of these users, it becomes impossible to know what everyone is doing and experiencing at a given point in time.
At these levels of scale, if you have a 1%, if your user is doing something differently, that 1% is no longer a single user, but they are millions of people. And I think one of the things that as an organization that’s important to recognize is that it actually is impossible to monitor everything, and that’s okay. And what you have to invest in instead is to be strategic in what you do monitor and have mechanisms of being able to know about the cases you don’t. Part of that is the engagement metrics we talked about earlier. How do we actually track interactions and what people are actually doing? And other parts of it could be how do we have safety nets through user reporting.
One of the things we found was interesting is we can track proxy metrics, where maybe we don’t know, when people upload a photo, if it’s always going to be perfectly exactly what they want every single time. But if we look at the amount of times people delete photos, that can actually give us a clue if something is going wrong, because people all of a sudden are deleting a bunch of photos. So I think organizations … You do have to be a little creative at times to understand what people are doing. And people are smart. They’ll figure things out, and they’ll try their best to do what they can as well. But yeah. It’s definitely not possible to monitor every single case, but it is important to make sure you monitor the cases that are important.
Abi: I love the way you explained that, because although computer programs tend to be relatively deterministic, although, of course, not at the scale and distributed nature of what we’re discussing here, but we forget about the other side of the equation, which is the actual users and ways in which they’re interacting with the systems which can lead to unpredictable events and failures. I want to go back to the original thread around the incident review process. You talked about how you were able to make the case of the business for why incidents and reliability in general are something important to focus on. What is the incident review process that you designed, and what went into that process of coming up with it?
Jack: So the actual incident review process is fairly simple. We will basically meet on a weekly basis, and every single week we’ll have a person who’s responsible for organizing and moderating the review, and then they will pull in … Usually on average we’ll do three incidents per week. The actual logistics in the review are pretty simple as well. We’ll have the moderator call on the presenters, they will present their sevs, and then we will ask a bunch of questions.
I think the piece that makes this successful is how we actually do the review and how we ask the questions. One of the big things that’s important for any organization is to make sure that the incident review process is blameless. So any time review comes across, it’s never the fault of an individual person that an incident happened. And so what we try to do is try to understand what actually happened, what kind of interactions led to this breakage, and how did we handle it. Were there things that could have gone better in escalation? Did the person who did respond have the right support? And so what we try to do is build a very empathetic review process where we’re all working together.
And another piece that I think makes this really successful is bringing in buy-in from leadership as well. I think one of the failure modes is if we have reviews that are only the people involved and either you’re presenting to a single person or a committee or something. But instead we try to keep it as an open forum, and we try to bring in leadership as well to be there to understand not just what happened, but also where the weaknesses are in the system right now and how do we invest long term.
I think the last part here is actually probably where we’re finding a lot of success is that we’re not just looking at individual incidents and prescribing a followup and saying, “Hey. Yeah. Do this one thing, and you’re good.” But instead what we try to do is we try to take themes week over week over week. And each week we only have three incidents, so it makes it easy to keep track of. But we’ll pull in … And we being we have a sev review committee of a few people, and we’ll meet every now and then to discuss, “Hey. Here’s some trends that we’re seeing. We seem to be having a lot of problems in this area related to this system.”
And then we’ll try to dig in, try to figure out why is that. Is there a fragile part of the system? Is there an overarching problem that we need to solve and invest in? And then what we’ll do is we’ll actually turn those things into roadmap items. And because we have leadership in the room, people who also have context over some problems, it makes it easy for us to be able to sell as, “Hey. Here is a good big opportunity for us to address a root cause behind all of these breakages.”
So I think that feedback loop that we’re building is really what’s making this successful, and it’s less so about the actual things we’re doing, but it’s a mindset of, hey, things are breaking and we’re going to take it seriously and treat this as a priority that we need to solve, not just as a one-off, but as roadmap items, as investments that might take months. But this is, I think, ultimately how we move towards a better product.
Abi: I really like how you mentioned you bring in leadership to spectate or be involved in these sessions, because it brings that visibility we were talking about earlier into the behind the scenes or less visible problems and challenges that the engineering organization is facing. You talked about the importance of having a blameless culture and an empathetic process. I wanted to ask you just to clarify. You talked about having a committee, and there’s a moderator. Can you clarify the individuals who are actually in these different roles and where they sit in the organization, just to understand … Is the committee people with bigger job titles than the people presenting in these sessions? Who are the people involved in this process, if you could clarify a little bit more?
Jack: I can definitely do that. So the committee is something that we constantly rotate people in and out of. And the sev review committee is a group of people who care a lot about the reliability of the product. There’s definitely no level requirement or tenure requirement to be on the committee. In fact, a lot of the times we’re taking people who are very active voices in reliability and saying, “Hey. You seem to care a lot about this. Do you want to get involved?” and just bring them in there.
The initial form that we took on with this was we’d just have a group of people who care a lot about the product and just show up and be there to serve as a backbone of asking questions and being able to have the context. And parallel to this, we also have a sub part of the committee, which is made up with more senior engineers, where we actually organize the different sev reviews. This is just a higher commitment role that some people have decided to take on, but it’s definitely not a hard requirement that anybody has to be a certain level or anything like that to do this. And this stuff combined is the logistics in how we actually run it.
Abi: You’ve already touched on this a little bit in a few different areas here, but, I think, if you’re a leader, everyone understands the importance of incident management. Right? When there’s an incident, you want to get that thing fixed as quickly as possible. It’s extremely painful, stressful for everybody. What is the main benefit of the incident reviews? Because I’m guessing, as it was at Instagram, that that’s not an area of the process that organizations are as focused on. So what are the biggest benefits? What are the things organizations are missing out on if they’re not investing in that sort of a process?
Jack: So, I think, for organizations the big value of having leadership in the room is to set the tone. I think that is something that could be underestimated, where someone in another position might think, “Oh. Well, I don’t really know what’s happening at that level, so there’s no value for me to attend.” But the reality is just by being in the room, it helps sets the tone to everyone else that, hey, this is really, really important, and I think that is one of the biggest things. And it’s a bonus that they do ask really good questions. But that is in itself a really, really high value add.
And I think the one piece that I think is really important for this is that … It kind of goes back to the point we were talking about earlier, which is it’s very easy for organizations to see working on reliability as a trade-off of, hey, it’s time sink that we’re not spending on building the product. I think, by having leadership in the room, it helps sets the tone that this is just as important as building products. And what we see overwhelmingly happen is as this happens, teams start to realize that, hey, we can improve the product through reliability as well.
We’ve had opportunities where we’ve taken product services that had a big noise floor of failures, and as we addressed those, we’ve noticed that, hey, the engagement on those services went up as well, because more and more people are being successful and the actions are completing and coming back and doing more of those actions. So I think we’re starting to see that reliability is not just how do people feel about the app, but it’s also a big business win when we’re able to take the work that we’re doing and amplify the impact of it.
Abi: Why is it so important that leaders be present in these meetings? Okay. So that’s the question that … So the original question was just, what’s the real value of just doing incident reviews in general? What’s the biggest benefit that’s come out of doing these outside of the leadership stuff that you just talked about?
Jack: I think the main benefit for sev review in general is to make sure that we’re not having the same incident happen twice. I think one of the ways that setting goals on incidents can fail is if the team hyper focuses on the number of incidents, because what this does, like we talked about earlier, is it encourages people to stop reporting on things that happen.
But the other flip side to that is if you have too many incidents, that in itself is not really a problem, but the problem is if the reason why you have too many incidents is because you’re repeating on things that happen. If you have a certain service that just keeps breaking, that’s probably something that the team should action on. And unless the instance are actually being reviewed, it’s really hard to have awareness that this is actually happening. And so I think the value for incident reviews for us is it’s really good at helping us surface the things that break and making sure that we’re addressing the root causes before it’s able to repeat again.
One of the metrics that we track as part of our incident reviews is the number of incidents we reviewed within 30 days, and this is a percentage that we’ll calculate on a rolling basis. And we want to make sure is we’re not just reviewing the incidents, but we’re doing so on a timely basis. In fact, what happens sometimes is we’ll have incidents that happened in the past and we just decide, “Hey. You know what? This happened too long ago. We’re just not going to look at it. Instead, we’re going to look at the things that happened recently.”
Now, ideally we review everything, but if we had to choose, we would go for recency, and the reason is because when things go unreviewed or unaddressed, it’s very likely that they’ll happen again. And so what we want to do is be recent, be relevant, and make sure that the cases we’re seeing today we are preventing and not repeating. So I think that is ultimately the goal is things happening once is actually good. Things happening twice, not so good.
Abi: As you’re describing, it’s so important to be thoughtful about the metrics we’re using to drive reliability improvements and steer the organization in the right ways. I love the way you thoughtfully explained some of the dangers of looking at number of incidents, which is, of course, such a common go-to reliability metric, and pointing out that that can incentivize the wrong behaviors, and I love the different approaches you shared there, so I think that’ll be really helpful to listeners.
So I want to shift the conversation a little bit and talk about tooling. I know you mentioned to me that one of the responsibilities of your team is to build tooling that you see as missing to help drive these reliability and performance goals. I want to ask about how did your team actually get into that business of building tools, when you mentioned earlier that’s not really your team’s primary focus.
Jack: So our team’s mandate is fairly high level, where our goal is to … We call it enable success of the organization, that is reels, and ultimately ship the best product we can. To us, I think building tooling is one of the ways that we do that. I think, coming back to what we talked about earlier, our team is not a team that’s aligned to a specific technical piece. We don’t really have a tool that we own or a service that we run, but we’re aligned to the product. And so what that means for us is we can go ahead and partner with different teams or spin up initiatives on our own in different domains to enable that success. And one of the ways that we’ve done this is we go back to the teams that build the tools and we work with them to develop improvements that turn into ways of making the product more successful.
Abi: Can you share a little bit more about the types of tools that your team has worked on? And I know merge queues is one thing particularly we want to go deep on, but maybe give an overview of the different types of projects and tooling that your team focuses on.
Jack: Yeah. A lot of the work that we do or we have done recently is in the area of what we call release safety. I think this is something that is not specific to Instagram or Meta at all. A lot of companies probably see this, which is a lot of the breakages come from when things change, specifically code changes, configuration changes. These are really common ways for us to break. If we don’t go ahead and ship any code, we’ll probably have a lot better reliability, but then our product won’t actually progress. And I think what this creates is a balancing act where, do you go and heavily optimize towards velocity and then just ship a bunch of stuff but then probably also break a loss off in the process, or do you go the other way and say, “We’re going to be super, super reliable and ship very, very slowly”?
I think the reality of this is that it’s a very, very delicate balancing act. And I think historically Meta, previously Facebook, had a mantra of move fast, break things. And I think what we noticed throughout the years was we have to balance this a bit better with how we approach reliability as well. And so a lot of the work that we’re doing on our team is figuring out what the right balance is. But the thing about this balance is that it’s not a straight trade-off where if you just do one, you lose on the other. And the reason why I say that is because I think improving tooling could be an enabler of both sides, where you can both ship things more reliably but also ship things faster as well.
One of the investments I can share that we’ve done a lot about recently was making sure that we’re enabling a way to run unit tests on both the code changes that we’re making as well as when we make configuration changes through other services. So some of those changes could be flipping what we call a knob in a UI somewhere that changes the way code paths work. So what we do actually is we’ll actually go ahead and take the unit test we have in the code base and simulate it against running that knob turned to the new value and seeing how that changes the behavior. So what we do with this is we’re not really compromising on the speed of release very much, but we’re adding a lot of safety in the process. And so a lot of what we’re doing is investing in tooling in this way so that we can enable this balancing act between velocity and reliability, and tooling is one of the best ways that we can scale these.
Abi: It’s funny you brought up the move fast and break things mantra, because there’s perhaps not a more invigorating yet at the same time terrifying motto than move fast and break things. And it’s really awesome to hear the way you’re really trying to cater to both of those goals at the same time, the merge queue, which we’re about to talk about is another way in which I know you’ve been able to do that. For listeners who aren’t familiar with the general concept of a merge queue, can you just quickly explain what is a merge queue and who needs to have one?
Jack: Yeah. So I’ll give background on merge queues a little bit. So the merge queue, at least the one I’m going to talk about, was something we built at Shopify to scale our release process there, and the way it works is instead of going ahead and merging to master before we do our release, we basically try to push the merge step as late as possible. And it sounds like a really weird thing, but what it basically lets us do is when developers try to make their changes, we go ahead and we queue up their changes in an imaginary queue. And then what we do is basically we will run a bunch of changes against each individual piece to figure out, okay, if we go ahead and release this, where is this going to break? And only when we’re confident that a revision is good do we take our master branch and point it to that point.
So what this gives us is it basically keeps our merges very, very malleable, where we can go ahead and remove any change that we think is bad at any given point in time without affecting production too much. And it also gives us an ability to keep master green, but to an extreme scale, where it’s not just test passing, but we can do things like run Canary systems as well. So this strategy in a way gives us the best of both worlds because it allows us to release things faster by queuing up changes and making sure we reduce idle time, but it also gives us a lot of safety because we can ensure that we have extra coverage so that when things do end up in master and ship to production, they’re in the best state that we can guarantee them to be at that point in time.
Abi: For people listening to this, can you clarify a little bit more the difference between a merge queue and just regular old CI/CD and a feature branch workflow? Right? Because with typical workflow with feature branching, you would, before merging, run your checks against your feature branch and then merge, whereas with the merge queue, your merge doesn’t get merged. It actually goes into a queue and then later gets merged automatically. Can you explain why that queue part is necessary in this process?
Jack: Any time you have a system and you have a bunch of engineers working on the same system, there’s basically two integration steps that have to run on every change. The first one is let’s say I’m working on GitHub. I put up a pull request with my code changes. I’m going to have a continuous integration test that run on my pull request, and at this point I’ll get a pass-fail signal of, “Hey. You passed the test. You’re good to merge.” The regular pattern is I press merge, it goes ahead and it gets merged into master, and everyone else does this as well, and then we end up with 10 changes on there.
Now, one of the things that can happen is that I make a change, and then you make a change, but we end up actually colliding. Some cases where this happens is let’s say we’re both adding to an array and we expect the array to be size 10, so we both start off size nine, both added something, and it should be size 10, but it’s actually now size 11. And so this is the case where things can fail, and depending on how critical this is, this could be a really, really bad breakage.
So what we can do is we can run some tests. We can go ahead and after we merged our changes to master, we can run the test again to make sure that we’re still passing. And if we’re all still passing, then great. Typically what happens in a continuous deployment framework is we will take the passing signal on master and we can deploy that revision and we’re good to go.
Now, one of the things that happens is the term breaking the trunk build, or you broke the trunk, you broke the build. And I’ve been at companies where, hey, if you broke the build, you got to bring donuts the next day. And the reason why this happens is because something like this happens, and I make a change, and you make the change, we’re trying to … We have a unit test that says, “Hey. Is the array still size 10?” But now it’s size 11 and it fails. And now what happens is we can’t release, because everything from that onwards is now failing.
And this is a really frustrating experience, because in order to fix this, you can’t just go ahead and take your change out, because once you’re a master, we can’t change it. It’s now permanently there. What you have to do instead is ship a revert of your change that goes on top of everyone else’s changes. So if you imagine you ship your change like 9:00 AM, and then three hours people have shipped another 30 changes, and then you notice your thing’s broken, you now have your change, three hours of changes, and then your revert, and you can’t deploy anything in the middle. You have to ship everything in the middle with your revert or else nothing is going to work.
So at this point, instead of having a continuous deployment cycle that is doing one or two changes at a time, you are now deploying 30 untested changes, which drastically increases the risk of your next deployment. And also if something does go wrong, you now have to triage through 30 different changes to figure out where the breakage is coming from.
So the merge queue kind of tries to alleviate this by making sure that we have all the tests passing before we do the deployment. So that way if you and I have a conflict, we can go ahead and just remove one of our changes and then have the other 30 changes set on top of it instead of having to ship your revert on top of the 30 changes. So that is a very long explanation, but that is how the merge queue changes that the model.
Abi: That was impressive. I have to say thank you for that excellent overview of what a merge queue is, and I think you, within that explanation, really covered a lot of the benefits and what it sets out to solve. When you deployed this at Shopify, was this really just focused on the … I’m assuming kind of like a monolithic code base that was heavily trafficked with lots of developers. Or was this something you guys just implemented across all code bases and repositories?
Jack: In the very beginning our target for this was the Shopify monolith. So there’s a lot of blog posts out there, but we had a very big monolith that probably had around a thousand engineers working concurrently on it at any point in time. So definitely we see a lot of changes, and it was very, very high impact. But when we built out the solution for it, we wanted it to be generalized so that we had the option of turning it on for any repository at Shopify. And the reason why we did this was because it was something that was definitely very valuable when you had a lot of changes, but the second part that’s really valuable is it makes sure that people don’t just break master on a small repo and never notice it.
We noticed this a few times on random repositories that someone will make a change, they would merge it and forget about it, but what ended up happening was that change was actually broken. And best-case scenario it never deploys and no one ever notices. But worst-case scenario is it deploys and now it’s out in the wild, it’s breaking things, when it’s something we could have prevented. So we tried to keep it open to both scenarios. I remember when I left the company about two years ago we had onboarded the main Shopify monolith, and we had also turned it on for what we called Shopify Web, which is a web front end for Shopify.
Abi: You mentioned that one of the benefits of the merge queue is alerting the developer when the change that they thought was good actually breaks the trunk build. How does that actually work? Is it a Slack bot? Or is it an email they get? What was your implementation of this at Shopify?
Jack: I’m trying to think back now. I think, at the time we used a Slack bot for communication. At the time at Shopify almost all of our operations was done through Slack, and I think the piece that was really nice about it was we would tell the engineer, “Hey. We backed out your change,” and there wasn’t immediate action the person had to do. In contrast, before we had the merge queue, if someone had broke the build, either there’s someone who’s confident enough to go ahead and revert their change, or we had to message the person and say, “Hey. Did you break it? Can you please revert it?” and it adds a layer of having someone respond. And especially if that person wasn’t on call and they’ve gone home for the day, there’s a chance they might not respond and you’re taking a risk by reverting their change. And without context, it’s really hard to know how safe it is and what kind of risk you’re introducing by doing that.
Abi: For people listening to this, when’s the right time to implement a merge queue? Is it something that should just be the standard workflow for any feature branch workflow? Or would you advise waiting until you start to actually run into these types of collisions and developers tripping over each other’s changes?
Jack: I would say the landscape today is that you probably don’t need to implement your own, which is a great thing. When we were setting out to do this, one of our close development partners was actually GitHub, and they were very interested to see how we were doing this. And I know that in the last year or two they actually released their own version of a merge queue that GitHub users can turn on for your repos. There’s also other companies who have done similar things, similar integrations, that make it easy to have a lot of this functionality built for you.
So I would say that I don’t think there’s a point where anyone has to turn this on, but, I think, once you need it, you’ll know, because you’ll be hurting from a lot of the problems. You’ll be suffering from the problems of a broken build. And the reality is every single organization will have a different effect when this happens. Some organizations, this is something that cannot happen, because let’s say you’re doing a compliance backend. You might always need a way to be able to release something quickly, because if a requirement changes, something breaks, you need to have a very clear path to recovery, and so something like this might be very useful, versus if you have a repo that has different requirements, maybe a broken build is actually okay. Maybe you’re building a game where you have weekly releases. Maybe this is something that is less important.
So I think every single team has to make their own assessment into is it something that’s necessary, because it’s not free either. There is a very high cost to taking something like this on. The way you operate is … It’s going to be very different. And there are cases that make it actually worse where … Let’s say something breaks, every single time you have a merge queue, and one of the items gets ejected. Everything else has to now restart the testing process, and so that’s a big drain on resources. But also if your testing infrastructure is too flaky, you might end up subjecting a bunch of changes that you didn’t intend to, and now your developers are losing productivity because their changes can never get in. So I think it’s definitely something that can be very valuable if used right, but it’s not a silver bullet.
Abi: It’s funny you brought up GitHub, because I was actually working at GitHub during the time where we were working on our monolith and its merge queue, and you’re right that we ended up dog-fooding our own product and eventually turning it into a product. I haven’t tried that merge queue feature on GitHub, but I’m sure that’s helpful for listeners to know that that’s something they can go and check out.
Jack, it’s been awesome talking to you today and diving super deep into reliability and the thoughtful ways that you’re approaching it both at Meta and previously at Shopify. I really appreciate you coming on the show. Thanks so much.
Jack: Thank you again for having me.