Podcast

Behind the scenes of LinkedIn's developer productivity metrics

In this episode, Abi chats with Grant Jenks, Senior Staff SWE, Engineering Insights @ LinkedIn. They dive into LinkedIn's developer insights platform, iHub, and its backstory. The conversation covers qualitative versus quantitative metrics, sharing concerns about these terms and exploring their correlation. The episode wraps up with technical topics like winsorized means, thoughts on composite scores, and ways AI can benefit developer productivity teams.

Timestamps

  • (01:10) Insights in the productivity space
  • (07:13) LinkedIn’s metrics platform, iHub
  • (12:52) Making metrics actionable
  • (15:35) Choosing the right and wrong metrics
  • (19:39) Why some questions are difficult to answer and track
  • (26:23) Top-down vs. bottom-up approach to metrics
  • (32:12) Winsorized mean and selecting measurements
  • (39:25) Using composite metrics
  • (46:57) Using AI in developer productivity

Listen to this episode on Spotify, Apple Podcasts, Pocket Casts, Overcast, or wherever you listen to podcasts.

Transcript

Abi Noda: Grant, so excited to have you on the show. Really excited for this conversation.

Grant Jenks: Yeah. It’s great to be here. Thanks for having me, Abi.

Abi Noda: So we caught up recently at DPE Summit where you gave this incredible talk. It was one of my absolute favorites. You’re a great speaker. The topics were of course really fascinating to me. We’ve had your colleague Max on the show to talk about what your group as a whole does at LinkedIn, but to start, share with listeners what you’re specifically focused on right now.

Grant Jenks: Yeah. So I focus a lot on what we call insights in the productivity space. So the best way to think about that is in terms of analytics, and I would say there’s both a qualitative aspect and a quantitative aspect. Quantitative aspect is typically fed by telemetry from tools, services, machines, hardware of some kind. Qualitative typically is telemetry from people. So people filling out surveys, giving us satisfaction scores. It’s still numerical, it can be, or it can be kind of qualitative like, oh, I felt really good today. But we try and combine those two and try and synthesize something from that that is really actionable for the productivity org.

Abi Noda: It’s funny you mentioned the qualitative data that’s numerical. That’s something that always bugs me. People say qualitative metric or quantitative metric, and in my mind a metric by definition is quantitative. What do you guys refer? Do you say qualitative versus quantitative metric or numerical qualitative data?

Grant Jenks: I mean, in my mind it’s pretty distinct between this came from a human and so it’s … Another term we put on it is, this is subjective. We can argue about this. Or really what ends up happening sometimes is you can send someone to go talk to the person who gave that data point. When it’s a machine, it’s like there’s nothing to argue about here. We know that on this date, on this time, on this host, Grant executed or attempted this workflow and it’s recorded as such, and it took this long or it came up with this error message or whatever it might be. It’s indisputable. So sometimes I think of it more in terms of subjective and objective, but that also feels, I think, kind of offensive to people because they’re like, well, don’t discredit the subjective. And actually that’s often the one that’s more valuable. It’s like, well, that’s what people think.

So even if the quantitative says, everybody’s builds are fantastic, if people are like, I hate my builds, you should probably listen to that. There’s actually one example here that always cracks me up. We look at JavaScript developers who are trying to have a really good experience in the front end, and for JavaScript developers, the ability to edit something and then refresh in their browser is effectively a build for them, and they want that measured on the order of seconds. Like, 1, 2, 3, done. If it takes 25 seconds, they’ll exclaim, I can’t do my job. They’ll be like, this is impossible. No one could be a productive here. And then you go and you talk with mobile developers, and I’ve had mobile developers tell me, oh yeah, first build of the day, or if I cleaned out everything, I did a hard clean, no caches, about an hour and a half to get into the simulated environment, and I think an hour and a half, that is awful.

I mean, you must do that and then wake up early, start the build, go back to sleep, wake up, do your job, and they’re like, well, it’s not quite that bad, but yes. And then they’ll tell me, actually, it’s a good experience. It used to be like two hours, Grant. And so you just have this bizarre dichotomy sometimes where you’re like, these developers are at 25 seconds and they claim they can’t do their job. These developers are in an hour and they say it’s pretty good. It’s never been as good as this. And so the quantitative and qualitative is hard to reconcile. Another way we think about it is in terms of, are you grading things relatively or are you grading things on an absolute standard? So we try and take an opinionated position on some things and say, any CI job that takes more than X minutes, it’s just bad.

I don’t care what you’re doing, it’s just bad. And we have a lot of good example and thought processes around, if it takes so long, then you can only get certain number of merges through a day. You can only deploy so many times. You end up building all these complex systems to scale. So we have this objective reality where sometimes we’re like, this is just known bad. And then we also have this relative experience where we say, yeah, there’s one person on your team who’s experiencing a lot of productivity pain. Their builds are really slow. Do you want to go offer to help them? Are they new? Is that what’s going on? So subjective, objective, relative versus absolute. These are all different lenses for us.

Abi Noda: Yeah, really interesting. I agree with what you said when you said the labeling or nomenclature objective subjective is potentially offensive. And I agree because in my experience, when people view it that way, of course, especially leaders, well, we’re going to just trust the objective data, not the subjective. But like you said, we’d know that subjective is oftentimes more valuable. But here’s the example I use when I kind of try to explain to people why human derived data can be viewed as objective. So I say, how old are you? And then I go, well, was that subjective or was that objective? So yeah, it’s an interesting thought exercise, but I know we kind of veered a little bit off topic, but yeah, the nomenclature of qualitative risk quantitative is really interesting to me as a problem of its own.

So I want to ask you, one of the things that’s really impressive about the work you’re doing at LinkedIn is this metrics platform that you’ve built. It’s called iHub. And I was confused when I saw iHub, it sounded like an Apple product to me originally. So for listeners, it stands for Insights Hub, of course. Share with listeners, what are the core capabilities of this platform that you’ve built?

Grant Jenks: So the vision for Insights Hub, Insights Hub by now is actually a suite of products. It grew out of data. So we began by collecting data. We collected build data, we collected CI data, we collected deployments data, and we started using your standard business intelligence tools. So Power BI, Tableau, Apache Superset, Grafana, all these different tools are great at making charts. And we used a collection of those to make the first generation of like, okay, here’s a dashboard where you can go look at a chart and you can say like, oh, here’s how often we had CI failures for this repository in the last week or whatnot. And what we found in doing that is that you kind of end up making … The way I see them … I’m going to blank on this. He’s a famous painter who’s known for making blocks of color.

Do you know who I’m talking about? It’s a really famous painter. Okay, it’s escaping me … If there’s a comment section, somebody in the comments will recognize this, but you kind of end up looking at these dashboards and they just look like big blobs of color. You’re like, well, wait, what did we really do here? It’s like, well, we made rectangles on the screen and these are meaningful rectangles. That’s our claim at least. And we did that for a while and then eventually we realized this is not having the impact that we want. And so the first step was kind of taking all these different things. I did this study at LinkedIn when we started four years ago where I was like, let me just see all the data we have and all the places you can go. And I think it turned into a 20 to 40 page document.

It was like, there’s a dashboard here, there’s a Google Sheet here, there’s a Tableau workbook here, there’s an Excel spreadsheet here. They’re all tracking different things. They all have different workflows. They all have different data sets. It’s overwhelming. Everybody has made their own little kingdom or little queendom and nobody’s synchronized. Very few people are automated. And so we realized there needs to be a single destination. So that was one aspect of Insights Hub. Let’s make a single destination. We realized too that … I think I gave this talk a couple of years ago. In the analytics world, you go from a progression, you can kind of be informative and then you can try and be a little more interpretive, and then you can really try and understand you could be predictive, you could be prescriptive. There’s dimensions to it. And we said what we really need is something where people can understand and manage their metrics and their data, and we really want to move LinkedIn as a business to being what we call data informed and metrics forward.

So it doesn’t mean that metrics become the hammer by which everything is crushed or everything is … It’s not the yardstick by which everything now becomes measured, but we want people to be able to do it. And oftentimes if you just go to orgs and you’re like, use metrics, this is sometimes the story that plays out. Someone up high pounds the desk and says, we need to be using metrics, and then everybody runs around, makes numbers and says, I think I fulfilled the need. And you’re like, that’s not actually what we mean by metrics forward. So Insights Hub kind of evolved out of this spattering of different dashboards. It evolved out of these business intelligence tools. We now as part of the Insights Hub suite have kind of three core products. One of them is really focused on the quantitative that we were talking about before. So this mostly renders metrics.

It focuses on productivity almost exclusively. It includes things like … You can answer questions like, how long is the median build time at the company? How long do code reviews take at the company? We also have one called, we call it persona experience. So that’s capturing more of the qualitative and you can ask questions like, what’s the sentiment around internal tooling? How happy are developers with GitHub, for instance? And then the latest one that we’ve developed is really focused on these operational reviews. So we find that whether you do them synchronously or asynchronously, there needs to be some kind of place that you’re looking at the information regularly. So we call this operational insights and you can kind of now create priorities. You can associate metrics with priorities. You can do weekly check-ins and think of it as kind of this hybrid between Jira and Power BI, and I don’t know if anyone’s ever heard of Microsoft Viva Goals is a thing. It’s in this hybrid space, but we’re kind of using that now as a framework for these operational reviews.

Abi Noda: As I listened to your talk at DPE, it was clear to me you’re someone who’s gone way deeper into this problem space than most, and so I really just want to, for the rest of this conversation, go through different things I heard you bring up in your talk and kind of replay them here for listeners. One of the things you said early on in your talk, I can’t remember if you had a quote up on the slide or you just said this, but you said it’s really easy to count things. It’s hard to do it in a way that is actionable. So just wanted to ask you to share what’s kind of in your head when you share this.

Grant Jenks: Yeah. It turns out that we’ve been on this journey within the productivity org of trying to act and be much more product focused. It’s like we should be leveraging the same technologies and the same infra that LinkedIn is a product is in our productivity tools. And that’s actually worked out quite well for us. It’s like, oh, the experimentation frameworks and the tracking frameworks and everything, we can reuse all this stuff. And it’s great because it focuses developers on the same set of tools and technologies, but one place where it kind of goes sideways is when people start looking at, well, business metrics. They’re like, I know how many members are on LinkedIn. The member number is super important to us. For, say, the code collaboration team to have a similar number where they’re like 4,000 or 5,000 or however many thousands of people made a merge yesterday or last week or whatever timeframe it is, you’re kind of like, well, what do we do with that?

If that number is dominated by hiring, what power is that within your control? And if there’s only one solution, you can only use your platform to merge code, where is the competitive … You’re not capturing market share. There’s one option. So that’s where I think we’ve been on this journey of saying, well, at one point we counted things. We were like, well, it must be important because there’s a million deployments, and then we rethought that and we were like, well, from a productivity perspective, the number doesn’t matter. How long they take matters. If it turns out they all take 10 minutes, that sounds like a lot of time taken. We should devote more resources here. Let’s make that go away. So that’s what I was thinking in terms of you don’t want to get stuck in the wrong set of metrics where you’re like, oh, these business metrics are really important to product teams, so we’ll use them. You kind of need a different flavor of metrics for productivity.

Abi Noda: And it’s really interesting your point around sort of measuring market share. I had this conversation with a platform PM at Spotify recently who was talking about what is our product as a internal developer tools team because developers don’t really get to choose. They have to use our tool. So he had an interesting mind. He said he thought of their product as providing a capability. That’s about as simple as I can put it. As you talked about the challenge of discovering what metrics are useful or not for productivity. So what are some examples, other examples you have of maybe ones that were surprising where you went into it thinking, or folks went into it thinking, this is going to be a good thing for us to track or focus on it and you later realized maybe it’s not.

Grant Jenks: I think the classic one for us is what we used to call commit success rate, and probably now you would call that merge success rate. So how often do you make it through CI and you get a green, which for us means that you’ve now created a set of artifacts that could be deployed, and this metric was a source of constant debate and struggle because ultimately what we discovered is that actually you want to catch things in CI. That’s why it’s there. If developers are running everything themselves locally and they have a hundred percent kind of success rate in CI, they’re probably going too slow. So you want to find this point of maybe 80% or 70 to 80% where you’re like, most changes, yes. If a developer is making a change every day, maybe one of those days they’re like, oh, I didn’t think of this case.

Okay, let me fix that and get back through. So what we shifted to in the CI space was really this concept of CI reliability, and this was much harder to measure, so it was kind of controversial. Do we really want to make this big of an investment? And the trick for us is going through all the stages of CI, figuring out in each stage, is this something that the user customized or is this something that the infra team is responsible for? So if you’re like, okay, first step of CI is I have to get clone, I have to check out the code base. Well, the infra team is responsible for that. If that fails, that’s a big problem. That’s not on product teams, but if your tests fail, well that is on product teams. It’s like, well, we think everything ran fine. Now, there’s a lot of shades of gray there because sometimes tests fail because you ran out of disc space and were like, oh, did you need us to provision larger machines for you?

Maybe we should have dynamically figured that out or figured that out ahead of time. But that was a super worthwhile investment I think because we can now separate two things where we could go to teams and say, for some teams in particular, their success rates were very low. Half the time it’s not getting through guys, something’s wrong with your tests. And now we could make that as a definitive statement, like it’s on you. And then we could look at other cases where we’re like, oh my gosh, availability of our artifact store was terrible last week. We had a major incident and that failed a ton of CI jobs, and that’s on us. So I think that’s one place where it’s still surprising to me after all these years that a fairly similar shade of a metric is much more successful. It’s like, oh, that doesn’t seem like a huge distinction, but whoa, does it have a big impact on teams and people’s receptivity to it. And sometimes it’s like, oh, that actually also takes a lot of investment because it’s hard to do that out of the box. We have to go dedicate resources to that.

Abi Noda: This reminds me of a conversation we had over dinner after the conference where we were talking how, or you were talking about how surprisingly difficult it is to sometimes answer seemingly simple questions in the business or with metrics. You have a bunch of great examples of that. Could you take us through one of those?

Grant Jenks: Yeah. So I think a good example that’s a little infamous in my world. There was a VP who asked one day, how many people work at LinkedIn? And I think he asked two different teams this question and he got back an answer a full week later. So his first surprise was like, why is this a hard question? I thought you’d just email me by the end of the day. Then the second thing that really surprised him is that he got different answers. He was like, whoa, your numbers aren’t off by five or 10. They’re off by five to 10%. What the heck is going on in these? And then the next thing that came was a document that described all these assumptions. So it turns out there are all these questions about if we’re measuring how many people work at LinkedIn, are you counting people who are sick?

Are you counting over what timeframe? Are you counting people who have left recently? Are you counting people on medical leave, on parental leave, on some other kind of leave? Are you counting people on vacation? Are you counting contractors? Are you counting vendors? Are you counting people transitioning from an acquisition that recently took place? And there’s so many questions there that we’ve actually started just using the term worker because we were like, you know what? When people ask the question, that’s what they ask. How many people work here? You’re like, okay, well then they’re workers. And there’s so many different kind of legal definitions and HR definitions and privacy and technical definitions where you’re like, okay, well if you want to get really precise, I have to know what you’re asking about. And I think one place that actually came up for us recently is in the last few years, we needed licenses for some technology and someone was like, how many developers are there?

And right from the start I was like, I can’t answer that question. What do you need it for? They’re like, oh, I’m trying to buy licenses. I was like, okay, give us some time and we’ll come back to you within a day. Came back, gave them a number. They look at the number and they knew roughly the size of the engineering org and they’re like, these numbers aren’t even close. Why did you discount it by so much? I said, well, you asked a very specific question. It was like, people who actually interact with this kind of code or something, I said, we have very precise information about that. You don’t need to buy licenses for all the managers or all the people in data when they’re never going to use this. We have a much better answer for you. And so I think that’s one area that we actually deliver a ton of value, particularly around licensing costs. Being able to be really precise is a killer feature, but it can be surprising because people ask these questions and they’re like, sorry, why did that take you a week? What are you doing?

I think for instance, one more thing, and I hope this is a little shocking but doesn’t get me in trouble. We had a simple question, how long does it take to build code at LinkedIn? How long does a build take? It took us two quarters to really nail down a great answer to that, and there’s all these dimensions to it, there’s all this nuance to it. There’s a ton of discussion. We had to form working groups to define what is a build, and I feel really good about where we landed, but it’s the kind of thing that when I’d mention it to people, they’re like, how could it possibly take two quarters? You take the median of a bunch of numbers. Didn’t you learn that in school? And you’re like, well, it’s a little more complicated than that. Median, you’re just barely scratching the tip of the iceberg if that’s what you’re doing.

Abi Noda: The one that is constantly pulling up my pants in my world is the definition of a team, particularly for organizing these types of metrics and survey results. You go into an organization and say, hey, you’re going to need a list of teams so that you can break down your data. And of course the immediate reaction is always, of course, we have that in Workday or whatever system, and then you realize this is actually only the departments and the management hierarchy. This isn’t really anything about who’s actually working together on similar work. Then organizations, some of them have service catalogs that have these groups or teams of people who manage certain services, but typically that’s more around technical ownership and accountability again.

Grant Jenks: We get the same thing. It’s like, are you looking at ackles because that’s not a team, or are you looking at the management hierarchy because that’s not necessarily a team. One of my favorite studies we did actually, so we took a whole bunch of interaction data where we could tell, for instance, I commented on your PR. Okay, so it looks like we had an interaction. So you take a bunch of this interaction data and you build a giant interaction matrix, and then you try and model some effect like, okay, we talked a month ago, so there should be some number, but a month from now the number should decay. So you build this model of this interaction matrix with this time aspect to it, and then you start asking really interesting questions like, okay, group Grant with everyone similar to him, they’ve all touched the same stuff, they all interact with the same things.

Or now imagine it like a graph and find me, are there any disjoint sets in here? Is there a threshold where you’re like, okay, there’s a bunch of groups and here’s the different groups and do they correspond to anything meaningful? So we did that and the results were fascinating. It was like, oh, that’s not at all what we expected. And for us, I don’t know what your experience has been, but teams typically are pretty small in terms of a tight-knit group of people working on something in a very focused collaborative way. It’s not a dozen. That’s pretty big. It’s much smaller than that in our experience.

Abi Noda: Yeah, same here. I mean, part of it has just been a multi-year journey of understanding how different organizations think about and define what a team is within their organization to then trying to rationalize a consolidated working definition. That’s something we’re still pontificating about more than we probably should. I want to move into something else you brought up in your talk that I thought was really interesting, which was that you talked about kind of a top down versus bottoms up approach to developing metrics. Share more with listeners about what you mean by that and what you’ve found works well in each case.

Grant Jenks: Yeah, so we use a framework called goals-signals-metrics. Goals are kind of like, what is your charter? What is your reason for existing? And then signals would fit into that as like, well, these are key indicators that you probably can’t measure. So the best way to think of signals is like if I gave you a magic wand, what is it you would want to know in terms of whether you’re achieving your goals? And then metrics are kind of the first order approximation of those signals. To me, that’s a very top-down kind of approach. You typically approach a director, you say, what’s your goal here working in this space? How do you know you’re successful? There’s the signals question. And then, how do you want to try and measure that? And typically that becomes a function of I want to ask these survey questions with this targeting and I want to combine that with some telemetry data that measures how long a workflow takes or what’s the success rate of trying to do this.

When we go to developers and try and do something bottom up where we’re like, okay, why don’t you tell me what you think is important? Unfortunately, if they’re the engineers that built the system, they almost always point at the service metrics first. They’re like, availability, most important metric. And you’re like, well, that is a very important metric, but for productivity? That doesn’t tell me anything. Did you make developers faster? Did you make it so that they’re successful more often? They’re like, well, no, availability will not tell you that. You’re like, okay, well, can you tell me that? And I think this is where it’s actually really important to engage with people early on because unless you build a system where you can measure those things meaningfully, you get kind of stuck with the service metrics. You’re like, well, API latency is 25 milliseconds at the median.

You’re like, cool. I don’t know what the heck that measures, and it’s going to be pretty hard … If you’re linkedin.com, obviously they can sell that number. It’s like that’s a super important number and I’m sure we work very hard to get it as low as possible. But if you’re a productivity org … One of the tools we use now is Spotify’s Backstage, we have built on top of it, we have our own thing called Engineering Experience. We call it EngEx. If EngEx changes its latency from 200 milliseconds to a hundred milliseconds, I don’t think anyone will notice. If it goes from 100 to 50, it doesn’t matter. So kind of trying to synthesize things bottom up, you get into these Frankenstein-ish type metric descriptions where you’re like, well, this measures blank and we think blank is related to blah, but we didn’t bother to measure that. And you’re like, ah, okay, let’s make sure there’s buy-in from the top that these are the core things that you’re trying to enable or make happen or make faster or make better, and you will know for sure whether or not you did so.

Abi Noda: Another anti-pattern, I think, in terms of a bottoms up approach is a lot of organizations put … It’s kind of what you described at the very beginning of this discussion where you just get a whole bunch of dashboards or colored rectangles, you just kind of put them in front of teams and say, hey, here’s a bunch of metrics. And the assumption there is, hey, these metrics are going to help these teams be better and give them all this insight. And I think at least in my experience, the opposite happens where teams are kind of overwhelmed with all this data and all these colored rectangles and don’t really derive much utility necessarily out of the box. So I imagine you’ve kind of run into similar things.

Grant Jenks: One of the most important questions to ask is to ask teams like, let’s jump ahead. Imagine I’ve done all the work to give you metrics and feedback and everything. Are you really going to change your mind? If I come back and I can paint a picture that really says you should go down option B and you’re already pretty set on option A, can you make that decision at this point or are you just going to do option A? And you have to be sensitive how you have that conversation. I don’t phrase it as directly as that, but that’s certainly kind of the gut check is to be like, why are we doing all this work? If it’s not actionable at the end of the day or if you’re just doing it so that you can put into your peer review feedback, I did my homework, we probably just shouldn’t do it. We have a long backlog.

Let’s skip this and come back at a time when we can really deliver value or you’re really uncertain about what you should do next. Because there’s plenty of teams at LinkedIn Scale in particular. This is another thing that comes up quite a bit. At LinkedIn Scale, there’s plenty of teams that are always facing these questions, and if we were at probably a smaller company, that would be pretty hard to come up with. I can imagine if you have less than a hundred people, I’m not sure how big your investment should be in productivity, but certainly once it gets to the thousands and the 10 thousands, you need that. That can have a huge impact on things. So that’s been kind of our experience. It’s also a negotiation. Sometimes like, no, I have to have this number, so we’re going to go make you a number. But yeah, numbers as products are not great products.

Abi Noda: Well, I want to talk about a few more technical topics with you. You brought up in your talk, I thought this was really … You talked about median, mean, P90, P70, and then if I recall, you brought up something called a winsorized mean as well. Take listeners, what’s your advice here? I mean, which of those is the one you should be measuring or in what cases is one more useful than the other?

Grant Jenks: So I think we love means when they apply. There’s the catch: when they apply. And a mean works best when you have a normal distribution. So it’s actually very easy to figure out. You take your data, you kind of build a histogram. There’s even algorithms that will tell you how close are you to a normal distribution. And if you’re not a normal distribution, you probably shouldn’t be using a mean. It’s just not the right use for it. A lot of what we measure turns out to be durations, and when you measure a duration, there’s always this long tail and it’s pretty thick. So this is a really well-studied problem in statistics that thick tails are not normal. That’s not the normal distribution. And it’s actually really interesting sometimes to chase down those. Developers will bring me a dashboard and be like, okay, Grant, we crunched all the numbers, we built the data pipeline, let’s do it.

And I always tell them, take 10 to 100 random samples from wherever and chase down every one of them so you feel confident that that’s a real number. And then look at the two ends and tell me at the lowest and at the highest what happened. Some of the most actionable things we’ve come up with have just been looking at the high end and being like, sorry, this build says it took two days. Is that true? What is happening? Or I actually was very guilty of this. I think I was in my first month at LinkedIn. We have 15,000 repositories here at LinkedIn, and I wrote a script on a Friday afternoon that had a list of all the repositories. I was like, I’m going to check out every single one of them and I’m going to try and build each of them over the weekend and I’ll be back on Monday.

Let’s see how far it gets. There’s an engineering manager in charge of the build platforms team, and he comes back, and I had been kind of clever, I had used I think the new parallel. So I was trying to use all the cores on my machine and all the disc space. I’ve actually ran out of disc space, so build times took a really long time. And the engineering manager came back, he started working, he had to report every week or monthly how long the build times take, and he starts looking at the day, he goes, something is wrong. And then he gets really worried. He’s like, what if I’ve screwed something up or what if there’s an incident and I need to get on it. It takes him two days to figure out I should just throw away all the data points from this developer named G Jenks.

I don’t know what the heck he was doing, but he screwed everything up for reporting. I totally had. It wasn’t intentional, I had just joined the company. I learned later we actually have special switches that you can add to say, don’t emit metrics. I’m doing something that I know. We have another developer actually internally, his name’s Dan. If something looks funny in CI, nine times out of 10, our first question is, is Dan doing something? Can someone just go message Dan and ask, are you doing something? Should we throw away all his data points or is it across all this variety of machines? So there’s usual suspects, but getting back to your original question about means, when they work, they’re great. A lot of times they don’t work. And so that’s where we fall back to medians. I think medians, I actually love medians.

You get to throw away tons of data points. It’s like, I’m just going to sort everything and look in the middle. Same kind of for the P90. You’re like, oh, let’s just look up here. Moving a median becomes really hard, and this is where engineering managers and tool and service teams have come to us super frustrated because they’re like, we did a ton of work, Grant, for instance with the say front-end developers. We were able to get those 25 second builds in the browser down to three seconds, and the median time is way higher than that. So the median didn’t budge. We moved a whole section of builds down here over here, but the median is over here and we got no credit for it. Why aren’t you using means? That’s fair. It’s like, yeah, that is one of the drawbacks. So typically we solve that by segmenting or creating a cohort.

It’s like, oh, you should really be measuring it for this group of developers, but when we can’t do that, this is where winsorized mean comes in. So what winsorized mean does is it says, figure out your 99th percentile and instead of throwing away all the data points that are above the 99th percentile, clip them. So if your 99th percentile is like a hundred seconds and you have a data point that’s 110 seconds, you cross out 110 and you write a hundred, and now you calculate your mean. So this is actually, I think, a kinder way to make the data fit a better distribution. It doesn’t reduce tails, so things like standard deviations and stuff maybe don’t apply, but you’re also not throwing away data points. You’re like, look, there are a bunch of data points way up high and maybe we just can’t explain them, but we shouldn’t throw them away because that would also be disingenuous.

So that’s one of the techniques that we use in a bunch of our metric reporting that I think has worked really well. There’s one other area. If you measure something that is typically a fairly small number, so one of the ones I shared in the DPE Summit, we now actually measure how many logins did you go through in a day? We have a lot of different systems. We have single sign-on, but you still have to click through stuff. So how many logins did you go through in a day? And when we first did that, I think the median was like … We were working with a team that kind of manages identity at LinkedIn and they were like, hey, we’re going to work really hard, but I don’t think we’re going to go from two to one. Can you give us a more granular number?

And that was a great example for a winsorized mean. It was like, yeah, you do want an average. And it’s a little weird to say, well, you do 2.7 logins per day, but if you can move 2.7 to 2.6, then you can see you’re making an impact. And so that’s one of the areas that has been really critical for us, using these kind of more, I guess they’re intermediate data science skills. They’re uncommon, at least in terms of what we see in the business intelligence space. So working those into our data pipelines has been great.

Abi Noda: Definitely uncommon. I don’t think I’ve ever come across a winsorized mean in my career. But I got really interested in it when I heard your talk. I want to ask you also … Something else you mentioned in your talk was about composite metrics. You talked about how especially leaders love having a single number, and I think sometimes you really do want or need a single number, but what’s been your experience with the pros and cons of composite metrics?

Grant Jenks: So I have a document that, I won’t say I’m famous for it, but I’m well known for a document titled What’s Wrong with Scores. And it came up actually because there is a score inside Insights Hub that features prominently. And I think by doing so we exacerbated the issue a little bit because people are like, look, they did it. They’re really good at this. We should all do it. And we were like, no, no, no, please follow our model, but not that one. And what people didn’t know was how much additional work went into that score, how much work went into calibrating it and measuring it and reflecting on it and refining it. It was a lot. So one of the key issues with these composite kind of metrics is you add this indirection, you’re like, okay, the number is eight, and you think now that you’ve reduced the amount of information, but you haven’t actually. You added to it.

It’s one of the classic, there are 10 standards that’s really frustrating, so let’s make a new one. Now there’s 11 standards. It’s like, well, you had five numbers, you built them into one new number and now you have six numbers. It’s like, you’re going the wrong direction. Like I said, it’s hard to figure out how do you combine them. You’re averaging all these different numbers. They usually don’t work that way. Some of them will be experience related and some of them will be durations and you ask, well, how do you combine those two? And even once you do combine them, what are the units of that? I don’t want to have to go to a VP and say, we took our score from 45 to 50, and they’ll be like, out of what? Are you measuring hours? Are you measuring success rates? 45 what?

And that’s where they feel a little meaningless. It’s been hard for us to get often stakeholder buy-in so the team feels confident, this score is really what we want, and then the stakeholder’s like, yeah, 12 means nothing to me. Can you just make this workflow better? I don’t know what that number is, but I don’t care. They’re pretty expensive to develop. Like I said, the amount of calibration that goes into them and aggregating things, they can actually increase noise. If you have five metrics and they all have some amount of error, sometimes that error all adds up and you’re like, oh, you’ve actually worsened your ability to reason about what happens. Fluctuations too are also interesting where it’s like one number goes up in the score and another number comes down, and so the score doesn’t change, and you’re like, we went from 50 to 50.

You’re like, well, so you didn’t do anything last quarter? They’re like, actually, we worked really hard. You’re like, well, it’s hard to tell from the number that you’re sharing. People will start to game them, like I said, fluctuation. So you’re like, oh, this number is really easy to move. Let’s just move that number. Well, yeah, but that number is not the most important. It’s like, well, it rolls up into the score. No one will know. And then the last thing is kind of this difficulty in comparison, so you’re like, last quarter the number was 45. This quarter the number is 50. You have a new tech lead come in or you add a new product and you’re now rolling up a new metric into your score. Is it really fair to keep comparing it to two quarters ago? It doesn’t carry over like that. You need to recalibrate things.

And so we’ve been pretty thoughtful about that. In Insights Hub for instance, we have something that we call the Experience Index. That’s like our score. But we don’t track it over time. We reserve the right to change the aggregation and the weightings behind it at any time. We tell people, don’t ever put this in an OKR. You should put a meaningful number, like a tangible number. This takes 20 minutes, we’re going to get it down to 15. This succeeds 50% of the time, we’re going to get it up to 70. It’s like, oh, that is super tangible. But if you take an EI, Experience Index, and you’re like, we’re going to go from 3.5 to 3.7, we really discourage people from doing so.

Abi Noda: I largely agree with your advice. It’s similar to my point of view. The one place where I’m starting to change my mind a little bit is qualitative metrics, because we talked about earlier on this show how the qualitative sentiment measures can kind of fluctuate based on all kinds of extraneous factors, and that’s a challenge I’ve seen for organizations that really want to use the self-reported developer experience as one of their north stars. And so one approach we found is similar to a mutual fund or stock index, by having some diversification in the index and making it made up of a handful of different measures, it’s been a way to keep the composite score pretty simple. No fancy weighting or calibration, although you can make the argument that that would make the number more meaningful. So you lose some fidelity for simplicity, and that’s interesting, but ultimately it does give you a number that’s a little more stable overall.

Grant Jenks: We have lots of experience with that. I think one thing that we’ve done in the qualitative space that I really appreciated is we present error bars actually. So we will tell people it’s a four out of five, but it’s like plus or minus 0.2. And if we could get more people to answer, we think we could get that down. It becomes really interesting actually too, if you create a feedback loop around that where, not super good with statistics off the top of my head, but I think if you get a hundred people who respond and they all give a score of one out of five, then you’re very confident that, well, that’s the score. We can stop surveying people now because we can tell. If you get a thousand people and 200 said one out of five, 200 said two out of five, 200 said three, and it’s like the whole range, then you’re like, well, I guess it’s 2.5, but it’s like plus or minus one.

There’s a very broad opinion set here. So that’s been helpful for us. We’ve experimented with smoothing too where one team, they did a trailing window, so they’re like, we’re going to look at this, but over a 90-day trailing window. That’s how far they had to go out to smooth the number down. It became so difficult for them to understand, okay, well week over week it changed by this. It’s like, okay, good luck trying to do that data analysis, find all the data points that were in there a week ago that weren’t in there two weeks ago, differentiate them in some way and then try and extrapolate why that contributes to 10%. And I think that team just got really bogged down trying to do that analysis, and we ultimately said, yeah, we’re not going to invest in doing it, it’s too complicated.

Abi Noda: I know one of the things you’re doing at work right now is actively exploring ways to use AI as part of all the work you’re doing. What have you explored? How are you thinking about leveraging these tools?

Grant Jenks: I think that there’s two different ways to kind of talk about that. One is looking at how engineers can benefit from AI. So we have things like the large language model playground type thing, and we have chat kind of things where engineers can go and get information and do retrieval augmented generation and things. What’s really interesting to me though in that question is as a practitioner of productivity, as somebody working in this space, how does AI change my experience or how does AI change how I want to build products? So one area that we’ve been thinking about that a lot is when we look at the quantitative dashboards and charts probably handle, I don’t know, we hope that they handle 50 to 90% of all questions. They should cover the majority case, but there’s a pretty large band of 10 to 90% where an engineer now has to go write some SQL.

And so we’re kind of thinking, I wonder how data analytics will be transformed in this space. And what we really want to do is just offer engineers or offer leaders a way to say, do you want to chat with your productivity data? Why don’t you play with it and then come back and tell us what you learned? And if we need to go build a new dashboard, this could be a great way to do so. I think also in the survey space, what would it look like for a survey to feel more like an interview? We do interviews sometimes and they’re pretty expensive. So if some kind of agent-based mechanism could get in there and say, hey, Grant, I just want to check in at the end of the day, how was your experience? It feels awkward because it’s kind of like, well, that’s what engineering managers do.

We don’t want to replace them or compete with them. It seems like that would be a good experience for people as opposed to, here’s 50 questions one out of five, can you read through all of them and answer them for me? And then there’s a number of really hard problems there where you’re like, how do I recursively summarize recursive summarization of feedback and do so in a way where when I produce the final summary, I can answer questions like, okay, what percentage of the feedback is that sentiment? And so you need to do some kind of analytics or bucketing alongside the summarization, or you get to the summary and somebody will say like, okay, I see that that got called out. Can you show me five examples where somebody said that? And so now you need to cite your sources. So somehow in this summarization process, we have to track all that and have it working.

Those are some of the ways that I think in a really transformative for us as working productivity insights and analytics, we’re also just looking as kind of the stewards of productivity data to make it more generally available. So we were talking just this past month about, okay, we should really have a strong embeddings type model for code review data, and we should just make it accessible to whoever needs it because we feel like people are going to build things and that will become the new Legos that they need. And if they don’t have that, they just can’t get as far. So I think those are some of the ways that we’re thinking in terms of data and experiences. We’re thinking about infrastructure and core platform technologies, but also in terms of like, oh, this will really change how we do things. It’ll feel maybe more like a conversation, but we don’t know how far that can really go at this point.

Abi Noda: Yeah, I know we’ve talked about different use cases. We’re in the midst of working on data summarization, which as you and I have talked about, is far more complex problem than it seems to onset. So lots of learnings there where one of our engineers is working on a blog post to kind of recap our learnings there. So we’ll be excited to share it with you. But Grant, thanks so much for coming on the show. Really enjoyed getting pretty technical on topics like winsorized means, composite scores, and also talking about more basic problems like how many engineers work here. Thanks so much for your time. Really enjoyed this conversation.

Grant Jenks: Thanks so much, Abi. It’s been great chatting today.

Abi Noda: Thank you so much for listening to this week’s episode. As always, you can find detailed show notes and other content at our website getdx.com. If you enjoyed this episode, please subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Please also consider rating our show since this helps more listeners discover our podcast. Thanks again, and I’ll see you in the next episode.