Podcast

How and why GitHub’s developer experience team has evolved

Liz Saling, Director of Engineering at GitHub, shares the story of how the Developer Experience group was founded and why GitHub paused features for a quarter to focus on making developer experience improvements.

Transcript

Tell us a little about your current role and the team you lead. 

I’m a Director of Engineering at GitHub, and am part of the Developer Experience organization. My group focuses on validation environments and testing procedures — basically the entire inner loop from builds, artifact storage, and testing, to getting things ready to ship to all of our environments. 

I also have the Ruby architecture team, so we also do a lot of things specifically focused on performance and best practices. 

As I understand, your team is part of a larger group at GitHub. Could you just share a bit more about where your team sits in the organization? 

We’re part of the Platform and Enterprise organization, which is largely responsible for the infrastructure that GitHub.com runs on. It’s responsible for the platform that we ship on-premise to many customers, which is our enterprise server and also our new hosted solution, which is called GHAE

The Developer Experience organization is the group that’s in charge of the entire development workflow: from the frameworks we use to code and the environments that we code on, all the way through the building, testing, and deploying to all of those environments. 

We’re also interacting with our security organization to make sure our processes are secure. We’re basically responsible for what all of our GitHubbers do to build and deploy our platform to all our customers.

What is the mission of your team?

We want to show the world how to build software using GitHub. And we want to be the example of how to use our platform the best. 

What I heard there was, you’re not only an example within GitHub of how to use GitHub and ship software effectively, but also potentially an example for customers outside of GitHub.

Exactly. I often say to my teams, imagine our next blog post that goes out to the world. What do we want to tell everyone? We want to show the patterns others should follow. We’ve proven them, this is what we do. 

But it’s been a long and interesting journey to get to this point.

Let’s get into that next. How was your team founded and what did that origin story look like? 

I’ll back up to about a year ago. It’s been quite a transformation. We were a smaller organization then than what we are today — in fact, we were only one set of three teams at the time. (Now we’re three massive organizations that have tripled in size, if not quadrupled.) We were largely focused on the GitHub.com platform and had other teams that covered our enterprise solutions. 

So we started collaborating much more closely with the teams that support our entire platform. We were talking more with the folks that build our client applications with our APIs and our mobile platform. That was all in an effort to tremendously scale, and to support the efficiency, effectiveness, and fun that we have as Githubbers developing our platform. 

What was the driving impetus around even creating this team at all?

We did a project in the summer of 2020. We were coming out of some reviews of what our reliability and our availability looked like, and noticed that we had to do better for ourselves and for our customers. 

Part of that evaluation was taking a close look at the DORA metrics. These are the basic metrics that you track to understand the efficiency of your engineering organization at large. And they’re used as a planning tool to evaluate where we want to invest. 

So we were working with Nicole Forsgren, and between the project that we spun up to address some of the availability and reliability issues and a deeper investigation on the efficiency of our engineering organization, we decided to take an entire summer to pause major parts of our roadmap to focus on these things. Almost our entire feature development stopped during that time. 

Today we’re focused on investing in what we call our “fundamental health measures” and then will factor those back into planning. But that initial investment was the beginning of this journey. 

How was it decided that it was in Github’s best interest to stop shipping features for an entire quarter?  

There were definitely a lot of alignment conversations especially between Product and Engineering, but the decision came down to recognizing the business impact those efficiencies would have in the long run to customers. And honestly, to the retention and satisfaction of employees too. 

It was really a big partnership between Product and Engineering to not only come together to identify that this investment of time would be valuable, but also in deciding what work we were going to prioritize and how. The teams also had to decide what exceptions had to be made to keep things running and how to communicate to our customers the commitments we were making. 

Without a tight partnership between Product and Engineering, it never would have been possible to even launch that initiative. 

You mentioned looking at DORA metrics and doing a reliability track, what were some of the things that came out as priorities to focus on? 

One of the big things that we focused on, interestingly, was documentation. Having good information that was easy to find. 

And I can’t say we’re doing it perfectly yet, and we continue to iterate on it, but we made massive improvements on what the source of engineering information was for us to follow. 

So that was super interesting, that information storage and maintenance needed to be invested in, and that insight also came from doing a listening tour with surveys and internal voting exercises. 

Another area we focused on was the length of time it took to run tests. We wanted to shave down the build and CI times. So a couple of the workstreams that we kicked off for that was related to pulling the enterprise tests that took sometimes hours to run, and instead of running those on every build, let’s pull those out and run them nightly. Or put in a mechanism where, if they don’t get picked up and solved, we will stop the pipeline as a forcing function. 

We also looked at other job times and how they ran: if they’d be restartable, resumable, and item potent.

Ultimately we narrowed down to 20 workstreams we could focus on (and I think our original list had 40-50 options), and went through those to identify the ones that potentially would have the highest impact on our availability, reliability, and internal developer experience. 

Note: The GitHub team has written about a few of the different workstreams they’ve focused on in their Building GitHub series: Introduction, Docs, CI Workflow, On-call.                  

One part of the process I loved was, once we narrowed the list down to 20, we went across engineering teams and found people to lead each one and decide how we were going to work on them. Also worth mentioning, we told these people that if we started investing in an area and it wasn’t having the impact we expected, we’d want to hear that feedback so we can make sure to focus on something more impactful. 

An example of how a priority shifted: we had been deploying in a train system for a long time, and one of the workstreams we chose was to look at the next version of how we could do deployments. The folks that worked on that were like, “Yeah, we hear you. You want a plan, and you want us to try some things and do a proof of concept, but we really think we could take it further.”

And what that ended up creating was a merge queue that we’re now running today, and looking to launch. So that’s been tremendously exciting and that all started from that summer.

I’d love to come back to the merge queue but first, you mentioned documentation. Somewhat surprisingly, that’s been a theme we’re hearing from a lot of other companies who’ve been speaking with. What did you learn when you gathered feedback about documentation, and what did you do tactically to make improvements? 

It fell into a few different themes, but the general overarching theme was we didn’t have established a single-source of engineering information.

Some teams were keeping information in their team’s repositories… Even within the GitHub platform, we have MD files right within the code paths, we have Wiki pages, we have posts… We have so many different options but we didn’t have one central location for all of this. So what we did was establish a solid presence on our intranet for engineering documentation and established that as the source of truth. We started pulling things out of individual repositories, Google drives, and everywhere else so that we had one common searchable location. 

How do you actually manage the process of creating and maintaining documentation? Is that something that’s centralized or delegated, or is it a combination of both? I’m curious who actually oversees that. 

We have teams that handle our outward-facing documentation. But for our internal use, it is up to every team to create and manage their documentation. Our standard best practices include that, with every feature you develop or infrastructure that you support, we do provide that documentation in a central, easily discoverable location.

In fact, it’s one of the hallmarks of working at GitHub, I would say, is the way that we handle communication and information. And, again, not going to say that we do it perfectly. But the fact that we do everything as much as possible transparently, and it’s done where people can discover that information, and it’s happening when conversations are happening… We try to make sure we’re really good about capturing summaries and recording them in issues where people can find them. 

So we don’t leave people hanging, let’s hear about the merge queue. 

Yes. So the mechanism, again, that we were using was a train where we would lump pull requests together into a special pull request, which served as the bundle that would then get picked up and moved through our deployment mechanism as a batch.

It was very tied with the system that did the deployments, and required somebody to watch over it very carefully. Somebody had to be the “train conductor” if you will. And you know, that was always nerve-racking. 

But all of the final build that was happening in that train, now with the merge queue is completely decoupled. The grouping of pull requests and the reviews and such can now happen outside of that main deployment mechanism. So yeah, this has been a big customer feature request for a long time. 

This is being shipped to customers but I believe it’s in limited beta now. Again, our practice is to make sure that we’re using things and proving them first, and then shipping to customers. So the merge queue was an excellent example of it coming together for us. 

We’re still on this incredible story of GitHub about stopping engineering feature work for a quarter. So after this quarter was over, fast forwarding to the current day, how are you staying on top of your developer experience now? 

Right and that’s so important, not only coming out of that quarter and the investment that we made, we don’t want to backslide.

We put in what we call anti-slip measures. Coming out of that quarter became the new baseline, and we wanted to make sure we could only grow from there. So the anti-slip measures are key measurements we watch carefully, even more than the core DORA metrics. 

Basically we focus on these three things:

  • Elapsed time. Either it’s from commit to merge, or commit to deploy, or just the build times. I go through that queue. What is the elapsed time that it takes for developers to ship their feature? Not just to one platform, but to all the platforms. And at what point does the automation kick in and take over, so they can go back to their day?

  • Success rate. (aka error rate.) If our test platform is failing developers and they have to retry, that’s obviously something that we would want to minimize. But there’s also the concept of flaky tests. The same test on the same codebase on the same environment, sometimes it passes and sometimes it doesn’t. How will we handle and minimize those, because those are also another friction point. That’s something that we keep an eye on and make sure that it remains in a healthy state.

  • The number of manual steps. This does factor into the elapsed time, but in and of itself, it’s important to understand how many click points and chat bot messages and whatever it is that you have to do to get your job done to get that code out onto your environment. If it takes you a hundred steps and they’re error prone, difficult to follow, or hard to remember, then that’s a separate signal. Everybody wants an ideal one-click solution. 

So those are the three big things that I look for across the board. And then we pair those with individual slices that are more actionable per department or team. 

We have a monthly meeting where all these things are measured and reviewed. We notice whether things are trending in a good direction or if they’re backsliding. We watch these monthly and then factor what we see back into our larger planning. Because if things are not staying healthy, that’s going to have an impact down the road.

So the faster and easier it is to stay on top of that and make space for that activity to happen, that’s how you stay healthy. That’s how you can see where you need to direct investments.  

Well, I really appreciate the piece about how your groups meet monthly to review the measurements, because I think that’s one thing we hear from so many leaders is they have some measurements, but they’re not sure how to operationalize them. I’m curious to dig a little bit deeper into that without sharing numbers, where are these measurements captured and displayed? What are your groups actually reviewing?

So we have a framework for everybody to feed their information in so that we can aggregate it and report on it. But we do not dictate that it has to come from specific tools. That we leave a little more loose, as long as it fits into the framework so that it can report in and aggregate.

And I think that that’s a good pattern. It always comes down to how much do you standardize and make it easy for everybody to participate. It’s a balance and it’s going to vary for every organization and situation. The important thing is that you do have something that’s coming together, and you’ve got a way to bring that information together and review it and factor it into planning. But yeah, for us, we have probably a dozen different tools that feed into the framework. 

Who is looking at these metrics? Is it primarily executive leadership and your developer experience group, or are you involving product and engineering teams into this process as well?

One thing that you touch on there really highlights the importance of the Engineering and Product partnership. We actually call it engineering, product, and design. We make sure that we are operating together as a close-knit unit. 

So yes, the entire process that I’m describing here is driven by engineering, but we’re just the ones driving it. Product is not involved every step of the way, reviewing it with us.  

And then as far as levels, our Senior Vice President of Engineering is the one who is accountable for all of it. He sits in on those meetings and oversees it. And every vice president gets their pieces of it and makes sure that they drive that health through each of their organizations. 

At the same time, I as a director have my pieces of it and understand what’s happening within my group, every engineering manager in our group also watches their piece of it and make sure that they’re staying healthy long before it makes its way up into the SVP. The idea is once something gets up to our SVP, then he just looks at it and says, thanks everyone, good job. Or, because we’re watching this all along, we can also give early signals to something that’s going in a bad direction. We can give a heads-up at the next meeting, “We need to pause on this and make some investments here”. We can come in with recommendations already. 

How do you, in your role, understand which pieces of the puzzle your group is responsible for driving across GitHub, versus which pieces the individual teams are responsible for driving for themselves locally. 

We spend a lot of time making sure that we have a clear area of responsibility. Not to say “hey that’s not my job,” but to make sure someone has eyes on these things. 

We also made a large effort to look at the entire inventory of services that we operate and made sure that there is a name with a team point of contact on every single one. The only services that are, I would say “unknowns” but not un-owned because we send them to our SVP, are the culture building chat ops. The fun things that we collectively own as a group. But beyond that, every service that we have has an owner and a champion. And then they state how much of that is funded, what the current state of it is in terms of, is it production ready? Is it experimental? And also, within that ownership you’ll get all the metadata that comes with it. So if something falls into the area of responsibility of one of my teams, I as director of that team become the escalation point for that.

So we have a clear, responsible team for every service. And that took a long time to get sorted through because there were a lot of things that had been spun up in the past and who knows who owned them. 

Something you touched on when you were mentioning the metrics you looked at before kicking off that quarterly initiative, was surveys. How do surveys play into the picture of the metrics you GitHub looks at? 

Every quarter we also do what we call a development satisfaction survey. It’s a way to check in with your folks about how effective they feel they’re being, and how they would rate their experience with various parts of the life cycle that they go through. 

We try to quantify on a regular basis to see how we’re trending. And that also helps direct investments into the area and gives really good feedback into those processes, so that we know where and how to invest our funds for the teams we have.

Those surveys are absolutely key for us, which is funny, because when we talk about surveys, people also talk about survey fatigue. And it’s a real thing because there’s a survey for everything, but we get really good participation because it really directly influences people’s day-to-day. 

So it’s really rewarding to see some of these comments, input, and feedback come in. 

The other thing I want to point out is we don’t rely just on the surveys. One of the things that we’re continuing to expand here, is figuring out how we can essentially productize our internal experience and have product engaged with that. We want to drive that just as we would any product. If we want to show the world how to do this, then it is a product, right? 

When your employees are engaged and taken care of, they feel heard, and these things are being acted upon, it pays dividends all the way. 

You brought up survey action fatigue, how does GitHub address that? What are things you do to try to make sure developers don’t feel like they’re providing feedback into a void?

Right! Because the question from them is, “did anything really materially change in the last three months? Why are you asking me all these questions again?” 

People to feel like they have a clear avenue to provide feedback, but the key thing is whether that action is leading to change. If you’re just surveying and then nothing happens and you don’t ever talk about it, nothing ever changes, that’s rough. That’s where that fatigue comes in. 

But if we do turn that into action and openly acknowledge the things we learned, here’s what surprised us or didn’t surprise us, and then report back on what changes were made, that’s when people engage. 

So as an example, we weren’t feeling great about our CI speed. So, every week let’s report on it: here’s what our CI speeds were. Here’s what we’re doing about it. And as we saw CI speeds coming down, we celebrated it. 

So communicating and keeping that feedback loop going has helped make sure people know that we’re listening to the surveys. They know we’re taking action based on what they’re saying in the surveys. That’s when you get the engagement because now they feel involved. 

And who at GitHub is doing that work of communicating and following up?

In this case, when it comes to the action being taken on the surveys, that does fall to our team. We partnered with the internal communications team to help, but yes, ultimately it’s our responsibility. The responsibility lies with us to make sure that this happens. And then it becomes implementation details for who we partner with to get it done. 

Your group obviously supports a lot of engineers across to GitHub. What are some ways you handle communication and keep that feedback loop going, with so many developers across the world that you support?

One of the fun things we do is “manufactured serendipity”. We’re not in an office, we’re remote, and it’s not like we’re going to be having lunch and randomly strike up a conversation. Those things don’t happen. So how do you create that opportunity? We use a chat bot that we wrote that will randomly pair people together in slack channels. The bot will say, Hey, meet with so-and-so. So once every two weeks, you’ll get a new introduction and a prompt to set some time aside to meet them. And that honestly has been such an amazing way, not only to network and connect with folks around engineering, but also the amount of feedback that I get from those meetings is incredible. 

Another idea is to set up an internal customer board, which is something that we’ve practiced in the past. Take a key individual from every part of the organization and establish them as a point of contact for the areas that they represent, so that they can go and solicit feedback and bring it to your team. And then, when we’re taking action, they bring that information back to their groups. That is a very intentional, scheduled activity with expectations of the participants, and that has been really helpful for us in the past. 

Another thing that we’ve done is to treat this like a product engagement. We treat this like a product because a development, collaboration platform is what we want the world to do. 

But whatever it is that you’re trying to do, make that a part of your business and treat it like a product.

So between manufactured serendipity and then very crafted intentional communication patterns to stay in touch with folks, those to me are the two key things. 

And I’d also say, just make sure the conversation doesn’t stop. There’s always going to be points of friction, right? There are always things that we want to see improve. I want our people to be the ones who are inviting and continuing that conversation. 

Resources