Skip to content
Podcast

Staffing infrastructure teams with Will Larson

Will Larson, CTO at Calm, covers a wide range of topics including whether Infrastructure Engineering is chronically understaffed, the role of Eng Ops, how his opinion on the “build vs buy” question has changed, his thoughts on metrics, and more.

Transcript

Abi: Can you start off by introducing yourself and giving a quick overview of your background?

Will: I lead the engineering and data science teams at Calm. Before that I was at Stripe, Uber, and a handful of other places. Also wrote a couple of books, wrote An Elegant Puzzle back in 2019 and last year finished Staff Engineer.

One of the things we look at on this podcast are all the different types of teams out there that are trying to enable and empower engineers. One of those types of teams are infra teams. I know you’re writing a book right now about infra teams. Can you share why now, and what the impetus was for writing that book?

A lot of my early career I was working in startups, and startups are a bit of everything. But as I got deeper into my career, I mostly did infrastructure work at Uber and then at Stripe. One of the things, when I came to Calm, was the excitement not to be doing only infrastructure work, although I do love it.

I’ve just been thinking about how much I’m forgetting about infrastructure with each passing week. I think there’s something really powerful about working in infrastructure. It has a certain worldview. But as you work in other roles, like larger roles, you start forgetting that worldview a little bit. So I wanted to start writing it before I forgot it entirely.

Before you were leading infra teams and now you’re a CTO. I’m curious how the world looks differently now that you’re on the other side. And specifically, I know you’ve written a little bit about this, but when we talk to teams they’re sometimes struggling with knowing how to advocate for their own existence, and specifically like the business case for infra or enablement teams. How do you approach that as a leader?

As I’m doing some interviews for this infrastructure book, this is one of the themes I would really focus on, because it’s such an interesting theme. I talk about the invisibility of successful infrastructure orgs. And so, if you’re a bad infrastructure org, people know about it because people are complaining to your CTO, to the head of engineering, to each other. Can’t get anything done.

The builds are too slow. The tests fail constantly. “Have a three year old MacBook, so I can’t use a new M1 build,” or whatever. But really good infrastructure teams are invisible. I think when you’re working infrastructure, this is pretty demoralizing, because you’re like, “Hey, I’m doing amazing work and no one notices.”

But the reality is almost every company you talk to… And there are exceptions out there. There are companies whose work, their business is infrastructure, if you’re working at Render, if you’re working at AWS, you’re selling infrastructure. Most companies aren’t selling infrastructure. You can think about Stripe, for example, the payments infrastructure, that’s what Stripe sells. But they don’t really sell like computer infrastructure or something.

And so, it actually is a sign of a business success when you’re invisible, but you have to figure out, how do you actually advocate for your teams and for your staffing and make sure like your infrastructure engineers have a career despite the fact that they really should be invisible if everything is going really well?

That’s really interesting and makes sense about success meaning that there is no friction and the team’s work is invisible. When you reflect back on your career, do you think these infra teams were chronically understaffed or appropriately staffed?

Different growth rates of companies really changed what that means. And so, I think when I was at Uber, we doubled headcount every six months. And so, I joined at about 200 engineers. Six months later, 400, months later, 800, six months later, 1,600. And it just kept going. And so the pain of infrastructure in that moment is really brutally intense, that the scale of traffic was also doubling or more every six months.

And so, I think in those cases, you’re always going to feel like you’re understaffed. The challenge though is, as you look at Uber’s growth rate over the subsequent years, is that you do peak at some point and then stabilize into a more mature business that grows in a different way.

And so, the challenge from the business perspective, I’m ignoring the pain and the plight of the actual infrastructure engineers, is that if you invest to make it work during the rapid growth era, you’re radically over-invested for the later eras. And Uber went through a number of layoffs. And so, it’s trying to figure out, what is it going to look like now, this painful growth state? What’s it going to look like when the business stabilizes, and how far out are you from the actual stabilization?

Stripe, conversely, really fast growing company, but a company that grew about 30% headcount year over year. There are lots of reasons why those businesses are not the same business. They’re very different in lots of different ways. But 30% year over year growth, a little bit milder. People can actually learn to get to know each other. Sometimes we grew closer to 60%, but 30 to 60% headcount growth.

You can actually pause at that point, if overhire a little bit. If you just slow down a bit, you’ll grow into the team, even in a mature business. And so, depends a little bit on the actual business you’re supporting. But I do want to say, and I think the most important lesson is that for infrastructure teams is often this perspective that you’re being ignorantly defunded, that there’s an ignorant perspective happening at the senior leadership level.

But actually, I think it’s a reasonable perspective. It’s just a really uncomfortable perspective to be stuck operating in if you’re in the infrastructure team itself.

You touched on something earlier when you were talking about the invisible work of infrastructure. You mentioned that for a lot of businesses, they’re not in the business of infrastructure. I know earlier we were talking a little bit about build versus buy. Does this play into that? I know you mentioned maybe your opinions on this or your approach to this has changed a little bit over the years.

Build versus buy I think is a really interesting question. There’s certain companies that have strategies, very fundamental strategies. Again, when I was at Uber, Travis Kalanick, our CEO at the time when I was there, believed that Google would steal our software if we ran on the Google Cloud. He believed Amazon would steal our code if we ran on AWS.

And so, at that era, there was no cloud usage at the company. I would argue that was a little bit further on the paranoid scale than the reasonable scale. It would be really bad for Amazon if it was known they routinely stole competitors’ software or something, also a violation of their contracts, et cetera.

But companies do really shady things at various points in their time. So it’s not an idea that’s totally unreasonable. I understand how we got there. But that meant we built everything in-house. At one point, Uber built a greenhouse clone, just everything in-house. There’s something to be said for this. Everything was hyper-specialized and it worked really well for the Uber context.

But if you ever have interviewed a senior Google engineer, a really talented engineer from Google, they come to your programming interview and they actually can’t program because they’re so used to all the Google tools, the Google libraries like, “Oh, this library just doesn’t exist.” You’re like, “Yeah, it just doesn’t exist anywhere else.”

And they’re great programmers. They just don’t know how to work in any of the other tools that they use in a more general startup, an earlier stage startup. And you specialize your team in an awkward way when you go this direction. What I’ve really thought about for smaller scale companies, and by smaller scale, I mean, those without a thousand engineers, those without like 2,000 engineers, we really shouldn’t be building bespoke infrastructure unless there’s a really compelling reason why.

And so, I think about, if you’re doing an observability startup, observability, like storing traces, spans, metrics, it’s actually an important infrastructure problem for your business. Your cost structure for an observability business, if you don’t have a specialized data store, is actually really hard to do something competitive.

Conversely, if you’re doing something like an issue track or a startup, your database isn’t that specialized, and there’s really just no reason to do it. And you’re going to lose a tremendous amount of time on it. And so, I think companies, often you get a little bit caught up on what engineers are excited about, but you have to really figure out your technology strategy from the business strategy, not from the excitement strategy.

If you design your vendor strategy around what’s exciting for engineers, you will excite people today, but you’ll make a business that’s very hard to operate and not exciting because you’re dealing with technical debt a year later.

You gave some examples there, for example, like greenhouse. And that made me think there’s such a wide scope of what you could call infrastructure for a company. From a standpoint of leading an infrastructure group, how do you narrow down what you should actually be focusing on?

This is a problem that infrastructure teams think about a lot because there’s a throw it over the wall idea, where you’re… A concrete example is with Stripe. Stripe was a very different company when I joined. There were very few managers at that point. I came in, I was managing 30 people directly. One of my peers was managing over 50 people directly. It was like managers are relatively useless, and maybe we are. But it was a very different company.

First thing that happens, as a new infrastructure leader at companies, people start bringing you stuff and you’re like, “Hey, infrastructure should probably own this thing.” And so, navigating that is really challenging, particularly like you’re just getting to know the team. The team, almost by default, infrastructure teams of fasting-growing companies feel really under-resourced that people are coming in like, “Shouldn’t infrastructure own the mail service?”

There was a PDF generator, “Shouldn’t they own the PDF generation?” It’s this platform all the way up to the PCI environment where the credit cards were tokenized, like," Hey, shouldn’t you own this because it’s infrastructure?" Some of those things were, actually, yes. And some of those things, actually, no.

And so, how did we think about that? First it was just like, “What can we actually take on today in terms of like staffing, resourcing, skillsets?” Second, there was I think a different mentality for certain things, that you needed a slow, thoughtful safety first perspective. I think our PCI environment is a great example of something that’s safety first.

We’d rather not implement a feature this month. We’d rather slip by a month or two months than introduce a security flaw. And so, we always took a safety first perspective there, even if we missed deadlines. It just had to be that way. Whereas there’s a lot of other things, I think like Stripe’s Atlas’ products, like helping people incorporate new businesses.

That doesn’t need a safety first perspective on a lot of it. Certain parts certainly do. But for the most part, it’s much more product driven… How do we iterate quickly? How do we add as much value quickly as we can? And velocity is a little bit more important than structured, defensive thinking. And so, to me, I think infrastructure defensive thinking is one of the things I think infrastructure can do really well.

It’s a pullback, it’s a counterbalance to the product velocity or the growth velocity that you get for things that need to move quickly. The other thing though is the reproducibility of the use cases. And so, if you only have one user, it’s not a good infrastructure choice because you’re really just part of that team that happens to be positioned in infrastructure for some sort of logistical political reason.

But you work for that team because they’re the only user. If you have many different teams using the platform, then it’s obviously a great candidate for infrastructure. The challenge though is, like a lot of things, people have them and they will sell you on the road show as like, “Hey, we’re the only team using Elixir for our services.” But down the road… Everyone should be writing Elixir. They just don’t don’t know it yet.

And so, I think one of the key things, and this is not a hypothetical example. At Uber, one team started trying to run Elixir and they were like, “Oh, man, this is terrible.” My first job at Yahoo I did a lot of early programming, and if you’ve ever seen those stack traces, they’re just the worst stack traces I’ve ever seen in any programming language. They’re not very helpful.

So it’s like someone’s like, "Hey, how do you debug this? I literally can’t read this. I’ve done Erlang for a while. I don’t know. But figuring out infrastructure like where the right interfaces are, I think such that you simply can’t take on things beyond those interfaces, but you can actually scale the different users within those interfaces is really valuable.

So that’s for when we rolled out services at Uber, the Docker container was the primary thing that we did initially. That was a little bit too broad, for what it’s worth. So we moved a little bit more towards like, hey, we have some template scaffolding. Well, we’ll scaffold services in one of six different categories. And if you don’t do that, just figure it out yourself. We’re not going to help you."

But finding that interface, anything beyond running the container initially, we just didn’t help with, and that worked out pretty well for us.

I will admit, as a crafty old Ruby developer, Elixir has caught my eye, though-

It’s a great language. I think a lot of people love Elixir. It’s not that Elixir is wrong. It’s more that in infrastructure, again, your position forces you to be a little bit crafty, a little bit thinking about the broader good, even in cases not supporting the specific good for any individual team.

I saw you recently did an interview that really focused on developer experience. And I saw a little chapter pop up in your book preview around that. So from the standpoint of an infrastructure team, where does this attention on developer experience come into play?

There’s a couple of different ways to think about DX. The most common way that people talk about DX is the external facing version of it. And so, developer evangelists, et cetera. By a quirk of nature, the developer relations team at Stripe was within this infrastructure group, but it just happened to work out that way. It wasn’t a grand design, but really phenomenal people.

Really loved getting to learn from them and work with them a little bit more closely. But I think it was relatively understood what this external facing DX concept looks like. I think interesting though, when you look at infrastructure, where you look at migrations, when you look at large roll-outs of new systems, the major way these things fail is a lack of attention to actually how users will use them.

And so, the number of use cases of very large migrations, new systems getting rolled out internally that fail, again, I would say almost every time they fail because of lack of a basic, who are the users? What are the different cohorts of users? What are the real needs within those cohorts? What is the risk/appetite or lack thereof within the cohorts for this new thing we’re welding out?

And so, thinking of a concrete example of this, Uber had a really terrible request routing infrastructure that my team was responsible for. You can call it request routing infrastructure, but basically it was like HAProxy running as a side car and every single server… And then you’d go to local host on a port that was statically allocated to a given server, and that would route across whatever your correct routing protocol was, maybe within the same region, whatever. Really terrible.

But it worked really, really, really well. And so, this is one of the challenges with production infrastructure. It was like a terrible statically allocated global ports is a terrible thing to do. They were initially recorded in a Wiki where you’d have to go look it up in the Wiki. And if you didn’t write it down in the Wiki, then someone else would claim it and then you’d have a production incident.

But it works. And so, there was a new system that was developed, partially rolled out, then ultimately scrapped. I think the challenge of that is it worked really, really well for a specific set of users, like one cohort. But it didn’t work particularly well for many other cohorts. I think when infrastructure engineers find an interesting problem, that there’s an easy bias to like, “Hey, if this is useful for someone, it’ll be useful for enough people to maintain it.”

But actually, as a company maintaining multiple different request routing tech strategies, it’s really hard to reason about. You’re looking at like a set of traces and they jump across different routing technologies where it’s easy to lose spans. It’s just hard to reason about what’s actually happening here. And so, this is, again…

I think if there’s one thing that infrastructure engineers need to get better at, and this is I think a core theme I want to keep delving into as I try to write this book, it’s actually thinking about the cohorts, product management, essentially, of like, what are you actually trying to build, for whom and why, or when will they adopt it

Again, a lot of the service migrations that you see people… Microservices, really, really big a decade ago. I think the thrill of microservices has ramped down a little bit. I think Kelsey’s services or the distributed monolith tweet or blog post or whatever from a few years ago captured this idea. But if we just thought a little bit about how things actually work, what people actually need, the company’s actually tolerant of, I think a lot of these problems are very obvious from the beginning.

We just have to do a little bit more work to actually do product management and not think of ourselves as pure technologists, or to not think of technologists as technology in the abstract, but actually technology in the service of a real concrete problem we have.

It’s interesting. You said you’re focusing on this intersection of product management and infrastructure engineering. In your experience, are there typically product managers as part of these teams? Is that something you advocate for now?

I’ve not seen much success for that. At Stripe, but also at Uber, I spent a lot of time talking to other folks leading infrastructure groups and was like, “Hey, how have you solved product management?” So some of what I’ve learned is that a classic problem is that there’s the product manager’s career path, what’s good for them, and then there’s the infrastructure requirements, and the intersections that usually not great.

And so, for example, I spoke with folks at one company, they’re like, “Hey, we can only get very junior product managers to work in infrastructure. After they mature a little bit, they want to bounce out to work on the consumer side of the business because the scale’s larger, the opportunity for business impact.”

All the problems that make infrastructure hard as infrastructure leaders, like we’re invisible, also impact the product managers in the exact same way. And so, I think it’s hard to find people who are really excited about that to work in a infrastructure team within a company.

This isn’t true, for what it’s worth, if you’re an infrastructure business that is selling a piece of infrastructure. So I think it’s much easier to find a product manager who wants to work, if you’re like Datadog or something, selling the actual product to end users. It’s really the internal infrastructure teams that suffer this problem the most.

But in my experience it’s very hard to find these folks. It’s very hard to find them, for internal teams in particular. So what we did at Stripe is we relied on the staff plus engineers and the eng managers together to do this. It does mean that those jobs get broader, more expansive than at some companies.

But one of my personal hobby beliefs is that we’ve gotten a little bit too specific with career ladders and that we’re forcing people in these specific holes. We talk about, we don’t want people to be fungible, replaceable blocks, but we also want perfectly clear articulate career ladders. And we can’t have both. We have to flex on one dimension or the other.

Yeah, that makes sense how being a PM that’s internally facing, you have that opportunity to ship the next big product for the company or launch the next big feature. This ties into the next topic I wanted to go into. I noticed you’ve been asking, in your interviews, a lot of leaders about what metrics they track. I’m curious why that’s of particular interest to you?

There’s a couple of different things I’m interested about when I dig into the metrics question. The first is I think there’s a question of how metrics oriented are these leaders at different stages? I think, at some levels, you could actually run an infrastructure organization of a certain size without looking at the metrics at all, as long as you have some sort of safety mechanisms, like monthly business reviews or something like that, where you’re not looking at a dashboard, but you’re holding people accountable to goals they’ve committed to, and you’re looking at those goals on a monthly basis with them, but the teams themselves are looking at them on like a weekly daily basis or whatever.

Also an interesting question of like, how much are people looking at metrics around goals and how much are people looking at metrics around reality? And so, there’s a Datadog dashboard out there that has the number of requests that are coming in. How many of them are failing? How slow are they? Et cetera. That’s how the thing actually works.

And there was a dashboard, which is like, oh, 99.99% of requests completing within 500 milliseconds. And that’s like, yes or no. That’s more of like an SLA that you might set to commit to a user around or something like that. I’m just trying to get, what are people actually looking at, at different levels of seniority? Are they looking at real metrics or these goal metrics that people are going to hold them accountable for and why?

So far, no super clear trend that… There’s obviously a lot to say on the topic of engineering metrics. I think it’s an interesting one. The biggest thing, I would say, is that for any metric, the more you look at one top level metric as opposed to decomposing it across like cohorts, categories, or whatever, it gets a lot more interesting.

For example, looking at your latency by region is a lot more interesting than looking at your global latency. Looking at your CPU utilization of your fleet across data center or across team that’s allocated to you, like data engineering versus machine learning, versus like production, those are way more interesting.

And so, that the more that people find segments to look at, the more confident I am that they’re looking at something meaningful other than just a anxiety reduction dashboard, that nothing’s completely wrong. Sure, but that’s not that interesting.

You touched on developer or just engineering metrics in general. So I do want to poke at that a little bit more. In one of your blog posts you wrote, and I copied this quote, you said at pretty much every company I know, the question of how to measure developer productivity comes up, becomes a taskforce and produces something unsatisfying. So I’m curious as to what types of experiences you’re referring to.

Most companies, this first comes up from this topic of, hey, how do we figure out which of our costs on engineering are actually innovation versus maintenance costs, so we can treat them differently in our books? And then you’re like, “Hey, which costs really are infrastructure, or maintenance, or operational costs? Which are innovation costs?” And you go down this rabbit hole. I’ve never seen a company actually be very happy with this ever. So that’s one category, one entry point to this thought.

The other one is when the CEO, the CFO, someone’s like, “Hey, what is the intellectually pure way we should be sizing our engineering team?” It’s really hard to answer that question. And so, I think the challenge is, heads of engineering, VPs, et cetera, are in this room and they’re being told to justify exactly how big their team is and how big it needs to be to accomplish certain goals.

They’re in that room with folks who can actually do this at a higher degree of confidence before you dig in too deeply. And so, for example, if you’re in the room with someone on growth, who actually is spending a large user acquisition budget, they can talk to how much budget each person is allocating and the results of that budget they’re allocating, like, “Oh, that really makes sense.”

Or if you’re talking to sales, they can do the exact same thing, like, “Oh, each additional hire I bring on is going to drive an extra 1.2 million of ARR this year.” Like, “Wow.” How much ARR does your next engineering hire drive? No one knows. I don’t even think that’s a real idea. But ultimately, to pivot that, I think this idea that we’re going to find metrics that help answer this question is a little bit misguided.

I think when we have metrics, what those metrics really tell people is that you’re paying attention to something. And so that you’re in the details, and so they’re going to trust you. These metrics build trust. The actual metrics themselves aren’t that helpful for what they’re being used for in many cases, which is like headcount, team sizing, et cetera. They’re really good at debugging teams and debugging productivity. But they’re terrible at actually understanding how big your team should be.

Before this call, we had just mentioned this new EngOps thing that is a trend. We’ve met with a few EngOps leaders and they’ve sort of differentiated themselves from traditional dev, prod, or infra teams, and being a little more closer to, for example, HR, a little more holistically focused on like culture and practices, as opposed to just tools. I’m curious on your take on EngOps as a function.

I think EngOps is pretty exciting idea. I think you always have to be a little bit skeptical about any function that spins up to take work off another function. It’s because the function who’s having work taken off them is incentivized to do that. And so, I think, in a lot of companies, people really want to hire more managers, because engineers are like, “Oh, the managers are going to do the work I don’t want to do.”

Or in a lot of companies, people are like, “Hey, we should hire more QA because I don’t want to do testing, and we’re going to push this off onto testing.” If you look at a lot of the trends in the industry, DevOps in particular, we’ve pushed a lot of work back into these software engineers’ wheelhouse, where they used to be, “Hey, I’m just writing the code. Then someone else figures out how to build, deploy, test, and operate it. I just write the code, man.”

So we’ve pushed that back and the engineer role has gotten really complicated. I think the same thing has happened for engineering managers where… When I entered the industry, I had like two one-on-ones in a year with my manager. They weren’t super insightful one-on-ones. It was like, “Hey, we’re going to give you a raise,” or like, “Hey, you’re doing fine.” That was it. It was like one every six months.

They’d just show up on the calendar every six months. And I was like, “Great, guess my manager remembers my name.” That was really it. The level of management we expect from managers today is radically different. I spend an hour a week with many of the folks that I manage directly, and that’s not totally abnormal. Many people might be spending 30 minutes a week. It depends a little bit on the roles, the velocity, how long you’ve been in role, et cetera.

EngOps is, as I see it, really trying to figure out how to create more bandwidth for senior engineering management to operate a lot of the organization. You think about calibrations for example. HR teams do that for many companies. For some companies, the managers just do it themselves. And then I think the EngOps is similar, or engineering onboarding.

Many times, the managers just do it themselves. But EngOps can help with that. I think the advantage of EngOps is that at a certain scale, doing things well and consistently saves a tremendous amount of time. And so, to me, it’s really just a question of like, when does it actually make sense to start bringing these folks on? And how do you make sure that they’re actually doing high quality work?

I think onboarding’s a great example of something where it’s very unclear whether onboarding is actually good or not, and figuring out the right metrics. And so, for example, at Uber, one of the things we did in onboarding was that, can every single engineer coming in spin up a new service?

This created a lot of problems related to the statically allocated port thing I talked about, where we have started giving out like a thousand ports every week for eng onboarding, which was not going well for us. Fixed that pretty quickly. But figuring out, what are a few outcomes that actually matter?

Otherwise, it’s easy to just say spending time on something makes it better. But that’s not actually true. Spending time on things does not necessarily make it better.

That makes sense. When we talk to engineering managers at the intersection of EngOps functions, it seems like part of the impetus for these EngOps orgs is to, like you said, fill in the gaps of what maybe engineering managers don’t really have the capacity to do, some things like focus on onboarding.

Do you think managers are under, for example, too much pressure just on delivery items that they’re not able to do the work of building good practices? Do you think that’s part of the reason these EngOps functions are popping up?

I think that is definitely one of the things that’s happening. I will say, so I love systems thinking. Love systems thinking. But as you go to any company, particularly any growth company, there’s this tension between, how do we do better meta work? Like how do we onboard more effectively and how do we do this really specific, important, critical task?

How do we, how do we finish our SOC 2 Type 2 audit this week?" Or something like that. And so, I do think that managers are often caught between these two different tensions. I think one of the challenges similar to as we’ve made the software engineer role very complicated, relatively complicated by moving things up from the operation side into the engineering role, we’ve just pushed a lot into the manager role in terms of the increased quality of management we expect from folks for that other work that they were doing before.

It’s not like managers weren’t doing anything before, they were doing a lot of work. Now that work’s getting squeezed into increasingly small segment of their time. That work has to go somewhere or just to fall to the floor. And so, I do think that EngOps has helped in terms of capturing that work that’s just falling. It’s gotten squeezed out by managers actually managing their people in more effective ways.

But the strategy work, this operational work, like run a business as opposed to running a team, I think that is getting squeezed out. How do we find more space for that? To me, EngOps are one solution. I’m really excited to see the industry experimenting, trying that. I do hope there are more different approaches we experiment with it. I’ve never seen one different idea work everywhere.

TPMs are similar to end jobs in this specific way. Many companies have rolled out TPM. It works really well for many companies. But there’s no consistency across what TPM means, that if you talk to four companies, you’re like, “This is a totally different job. It just has the same name.” I think EngOps is in that phase right now. I’m curious to see how it falls over the next decade.

This balance between running the business and running a team, how have you approached that personally with your reports and managers at different companies?

An understated reality that I think is really important is that when you work with one team consistently for like four plus years, the amount of time you spend running that team really goes down. You just know them really well. You know what they’re going to be excited about? You know what they’re going to be pissed off about

You know their partners, you know their kids. You just have these depth of relationship. In really fasting-growing companies, you never get there because the team that you’re working with keeps changing so quickly. And so, I do think that these fasting-growing companies are exceptionally challenging in terms of getting space to actually operate the business.

That’s difficult because these businesses are changing really, really, really quickly as well. And so, I do think that, one, if you just stay with the same team for long enough, this gets a lot easier. That being said, I do think, for me, a good week, I might have like eight hours to actually operate the business as opposed to operating the team. And that’s not really enough.

And so, I think we’re in a very strange year, 2022. We’re going to see a lot of businesses that are not going to come out of 2022 in the same form they’re going into it, as you look at the funding changes, as you look at the macroeconomic shift happening. A lot of that’s, in my mind, because we’re spending too much time locally, like optimizing the team and not looking at the business more widely.

That being said, I think this year is going to be a year when people are forced to look at the business more widely simply by the rate of change around this. I think that’s going to be, one, very painful, but, two, I think good in terms of helping us reset a little bit away from only focusing on team management, and getting a little bit of a broader, more equal perspective on it.

Thanks for sharing those thoughts. I want to switch topics to something a little bit tactical, because it’s something people have been asking us about. You have a chapter around surveys, developer productivity surveys, and you wrote something funny in that chapter, something about how in the best case, you get all this great insights about your organization. In the worst case, you get this data, you do nothing and everyone is bitter about it. So what’s been your experience with running these types of surveys?

Surveys are really interesting, and there’s lots of surveys out there. I think one example I can think of from running a survey was we had a specific infrastructure engineering team at Stripe who had some users who were complaining about them a lot, like, “Hey, we’re going to solve this by surveying the users frequently.”

We got monthly feedback and we built like an NPS model. We got like, what are the concerns, what’s going well," blah, blah, blah. We showed really significant improvement to NPS over the three or four months, and significant commitment, excitement about the platform. But the people who were complaining never stopped complaining. And they never actually got more specific about their complaints.

It was more of a, “We don’t like it. We want to do it a different way.” This was demoralizing, because you think you’re going to roll out a survey and you’re going to fix this problem of like lack of alignment with a certain cohort of your users. But you can’t, necessarily. In this case, it wasn’t a situation where there’s a clear complaint and we weren’t hearing it. This was more of a broken trust or broken relationship between two different groups.

Surveys can’t solve everything, and I think going into that, we actually knew that there was a bad relationship there and hoped this would help create some visibility into that. But it turns out, it didn’t actually resolve anything for us. So that was an example of a case where the survey, I think intellectually, pure way to solve the problem we had, didn’t actually have much of a successful outcome.

Conversely, I think we ran a broader developer productivity survey every six months or so. And this was really, really helpful for us. It was helpful. We got to see which different platforms were degrading at different rates. For us, there was the Ruby and the Scala infrastructure were two different major components.

The lived experience in those two different ecosystems was very, very, very different. And having that visibility was super helpful for us, because it’s helped us pick for the next projects, where do we want to go, but also helped us pick what not to invest in. And so, for example, there are certain improvements on the Scala side.

The ecosystem there, just the number of people working in it was too small to make a large investment, even though they were very unhappy with it. And that data is really helpful. But where we got a lot of success was not tracking net promoter score or satisfaction. Was more understanding how things were moving in relation to what we were doing.

For example, we’ve made a tremendous number of build improvements over the years, and seeing whether people got happier, by which, I mean, they started complaining about something else. Not literally happier, but a new number one complaint. That’s how we were able to understand, did this actually solve the problem or not?

Cohorting is really important as well. One person I spoke to for my interviews talked about looking at the complaints from engineers coming from long-term companies. For example, if someone was just at Facebook for seven years, coming in, that complaint is going to be meaningful in a different way than someone who has only worked at very small startups, just coming in.

And so, understanding how people’s background impacts their perspective on whether things are good or best they’ve ever seen or worse they’ve ever seen depends a lot on where they’re coming from. And pulling the HRIS data was really helpful for that.

I’m curious, as you ran both examples, the monthly and then the every six month, did you feel… Because you’d touched on this in your book or post. Did you feel like there was enough of follow-up happening that made the survey a long-term sustainable thing or in some cases, have you seen them not be sustainable?

Yeah. I’ve definitely seen both cases. I think for the six-month one, the organization on developer experience was, man, I want to say it was in maybe 50 folks or so when I left. And so, that’s a lot of folks. We were spinning up new teams who were shifting priorities, et cetera.

So every six months, if we couldn’t do like three or four major projects for them, something was going wrong and then that was on us. So I thought it was really valuable there. Particularly seeing places where people had complaints and then going and talking to them directly was where we got the best insight.

So figuring out where are the hotspots and then going and talking to them directly, not relying on the survey. I think as you’ve done a lot of surveys, you know that if you don’t actually go do follow-up, people have pretty abstract complaints, like, “This is terrible.” Yes, tell me more. Or like, “This is amazing.” But it was also like, okay, what does that actually mean?

Because sometimes “This is amazing” means like my friend did it, and sometimes “This is amazing” means like, “My builds are really fast,” and you have to dig in to figure out which of this actually means something.

When you would run these surveys, was there a percentage of feedback or insight you would generate that was actually not actionable to the developer experience group, but rather were complaints about things just happening on the local teams that they needed to do? And how did you respond to that?

Oh, 100%. And so, I think, for example, a team with really weak onboarding practices is going to come across as a team that really doesn’t like a lot of your tooling. And so, how do you figure out, is this just because they’re joining a team that doesn’t do any onboarding, versus how do you figure out if your tools are actually too hard to use?

There’s some definite judgment in there, and that’s where I think having people with great relationships… This is one of the reasons I love embedding, for what it’s worth. So something I’ve always tried to do with infrastructure teams is have folks on their once a year, once every two years, go spend three months on a partner team working as an engineer on whatever they do, because, one, they get to bring some of the context of, how do the tools actually work?

Two, they get to bring like, okay, here’s where the tools actually don’t meet our user needs very well. But three, they just have this relationship or you get that complaint next time like, “Hey, could you go talk to this team you embedded with? What’s really going on there?” And get like a little bit more detail there. Because I think attribution is always a challenge for errors.

And so, I think you do just have to dig into it to try to have a clear point of view. But the reality is some of your teams that you work with just won’t be super put together at certain points in time, and that’s like unavoidable. But you do have to dig in just to know like, hey, am I going to do something about this or not?

Sometimes what you do is literally nothing. You’re just not going to effect change there. Sometimes teams are like a death march to cheaper product and nothing you do will make them happy. They’re just in a really bad spot, and that’s okay. But sometimes like, hey, the talk to EngOps team, “Could you help this team with onboarding? They need a little bit of help to bring together a care team to help the team get over a bit of a hump that they’re in.”

Yeah. I like that analogy of EngOps as a care team. I think that is similar to the ways that they sometimes view themselves. Well, thank you so much for coming on the show. Well, I really enjoyed this conversation.

Likewise. Thank you so much for having me.