Measuring and rolling out AI coding assistants

Abi Noda: Eirini, it’s so great to have you on the show. Thanks so much for your time.

Eirini Kalliamvakou: Thank you for having me. It is really exciting to be here.

Abi Noda: I emailed you a few weeks ago because I’d just gotten back from a conference where I was talking with lots of leaders, to no surprise, who were talking about GitHub Copilot. And one thing that kept coming up was that they were struggling to find ways to measure it, tell a story around it, and justify it. So I emailed you because I thought who better to talk to than you? So here we are. To start, you’ve been at the forefront of the AI developer tool revolution; if I can call it that, share with listeners a high-level overview of what your research focus has been for the past year and a half. Then, we’ll dive into some of the specific studies.

Eirini Kalliamvakou: The last year and a half has been wild in terms of research. A lot of it has been focused on AI and how AI gets adopted, and more specifically, on GitHub Copilot; we did a lot of evaluations. So, my research was focused on evaluating how Copilot is used by developers, teams, and organizations and what its effects are. This was both to help us make decisions on how we develop and build the product and where we direct our investments. But also as a way for us to broadcast outward because AI is a change, and with change comes challenges about what is in it for me. So, we wanted to broadcast outwards what people can expect as they adopt AI tools in terms of effects and benefits.

Abi Noda: One of the most visible studies that I think has been done on AI developer tools is the study on Copilot that you co-published last year. So, I want to start by doing a deep dive into that study. To begin with, let’s go back to that moment when you began that study. What was the context and the biggest questions you were aiming to answer with that study?

Eirini Kalliamvakou: The study that we’re talking about involves a large-scale survey. This was when GitHub Copilot was still in technical preview, so that was before June 2022. We had hundreds of thousands of people trying out and using Copilot daily that were in the technical preview, and what better opportunity to understand how Copilot is used and what its effects are than having this kind of large population of people? So, the motivation was for us to understand people’s experience with GitHub Copilot. So, we designed this large-scale survey as our means of understanding.

Before the large-scale survey, we also did a round of interviews, which is one of my favorite things to do if I design a survey. I interviewed first to know what we’re even measuring. So, the motivation was to see what the experience is. Then, we realized that we could use the survey results in a couple of ways. One was to understand the productivity effects, and I’m using productivity here in the most holistic way possible. I know a lot about that as well. So, we use the SPACE Framework to operationalize productivity. Still, we also correlated what we were getting from the survey responses with a lot of telemetry that we could have by having people use Copilot in the technical preview.

Abi Noda: For listeners who piqued their curiosity with the aim of this study and trying to measure the impact on productivity, take listeners a little bit through the discussion and the thought around how you defined and measured productivity for this study. Some people might immediately think, “Oh, you correlated the use of Copilot to the number of pull requests or something like that.” So, take listeners through how you approached this problem.

Eirini Kalliamvakou: Yes. It’s one of my favorite things to do. Okay. So, there were a couple of things that we needed to figure out from a research perspective. Right? One was how do we capture productivity through our survey questions, right? That meant that we had to define it and create questions that were able to capture it through the survey. I mentioned that we use the SPACE Framework. The SPACE Framework has these five dimensions of productivity. It does include activity, so how much of something people are doing and the rate at which they’re doing it. But it also includes things like satisfaction, efficiency, and flow that are equally important and sometimes overlooked.

So, we designed the survey with the SPACE Framework in mind, which means we translated all of the different dimensions of the SPACE Framework into questions. That was on the one side, right? How do we get the information through the survey?

On the other side, we had a lot of telemetry that was mostly focused because we were talking about AI and GitHub Copilot specifically. There was a lot more to do with the acceptance rates of Copilot, right? So, Copilot, just a top-level description, produces and suggests code and generates and suggests code to developers as they are typing. We wanted to see how the acceptance rate of Copilot correlates with what we saw in the survey as self-reported productivity.

There was also a self-serve reason why we did this. At the time, we had something like 11 or 12 variations of the acceptance metric that we kept. And you know this as well as I do: setting up telemetry and instrumentation and maintaining it over time to get all this information is extremely costly. So, we were also looking to condense. We were thinking, "It’s unlikely that all these 11-12 metrics of acceptance rate are equally informative and equally useful, so can we condense them? So through the correlation, not only were we able to show that the self-reported effects that people get from Copilot correlate with how they’re using the tool and the productivity that we see in terms of acceptance rate, but it also helped us move away from maintaining 11 versions of acceptance rate into one or two that were the most helpful.

Abi Noda: That’s interesting. I want to list off, just for listeners, some of the types of things and the exact measures or factors that you studied. So you had things like perceived productivity, which was the statement, “I am more productive.” For satisfaction and well-being, things around frustration levels when coding, fulfillment in the job, and satisfaction with work. For efficiency and flow, you had faster completion, speed of completing repetitive tasks, feeling more in the flow, and less time searching. I’m sure we’ll talk about some of these things more in-depth later, but take listeners through how you conducted the study; what were some key findings?

Eirini Kalliamvakou: One of the key findings that I just mentioned was the fact that we saw this correlation between the acceptance rate, people using the tool in this productivity mode of accepting the code that is suggested, and that is strongly correlated with all these aspects of productivity that you mentioned, right? Not just the top-level statement of, “I feel more productive when I’m using Copilot,” but also all these other parts you mentioned.

The other key takeaways that kicked off a lot of interest afterward to look into things a little bit more was how high the percentage of respondents that mentioned getting benefits in terms of being able to stay more in the flow when they’re using Copilot, being more fulfilled with their job, being able to conserve mental energy as they’re working on repetitive tasks. And, of course, the all-time favorite, or the thing that we started from, was that people feel that they’re completing tasks faster with Copilot. I would say that one of the key things that we saw was that in addition to or beyond the speed gains, people were both feeling. They seemed to get from how they were using the tool; we also saw benefits in terms of all these other ways that contribute to productivity, like flow, fulfillment, satisfaction, and so on.

Abi Noda: I have a question about the methodology of the survey itself. Share with listeners when this survey was presented to the study participants. For example, was this more of a relational survey where you asked participants to reflect on their prior use of Copilot? Or was this more of the moment, journal, or diary-type study you’ve run in the past that was more asking developers right in the moment or right after they used the tool to reflect?

Eirini Kalliamvakou: I would have loved to do it in the moment way, but there are always tradeoffs that we need to think about, right? We’re trying to preserve people’s flow, so interrupting them at the moment as they’re using the tool is always a tradeoff we must keep in mind. We asked people to reflect on their use of Copilot up to that point.

And we had information to tell us, and we weighted things accordingly in terms of how long they had been using the tool. We didn’t ask them to estimate that for us, but we asked them to reflect on their experience with Copilot thus far. This was an email that we sent out to folks who were in the technical preview with the survey. I think the survey was around five minutes. It was designed to be around five minutes, but we also saw from the completion that that’s almost as long as it took.

Abi Noda: For listeners who are more interested in the study’s findings, we’ll put this on the show notes, but they can go to the article you published on the GitHub blog or the original paper itself. The last question I want to ask you about this study is what was most surprising to you, and what were the reactions or what was surprising to the other researchers on your team?

Eirini Kalliamvakou: It was something I was expecting, yet it was surprising when I saw the actual percentages. So, I was expecting it to be part of our hypothesis, right? That’s why we designed the survey in a particular way. I expect people to respond that they find that when they use Copilot, for example, they spend less time searching, stay more in the flow, or are more fulfilled with their jobs. Just because I was theory-informed, I put that in the survey and expected that to come back as a result.

What I was not expecting was the rate of responses, right? So, I found the high percentages when people were reporting these benefits pretty surprising. I think that all the people that I talked to in our product organization at the time when I was delivering the survey results to them and I was presenting the results, I think that one of the things that they consistently commented on was that we had 60-something percent of people saying that when they’re using Copilot, they feel more fulfilled with their job.

The comment I kept getting from them was, “So are you telling me that this is not just a tool that people use to make them go faster? This also makes them feel better about the work they’re doing and the job they’re in.” And I think this is something that also generalized with my team and the other researchers on my team.

I am currently working with a team that has folks who are very strong quantitative researchers. They were the ones who had done all the instrumentation of the telemetry that I mentioned. So, a lot of how they think about evidence is specifically on what we can track, right? What our logs say. And so they also found it very surprising to see that there are benefits in this tool that they had never thought about, even though they were the people that were making the tool. They intended it to be a productivity tool; they had just not thought about all these other ways that it could help people feel and be productive.

Abi Noda: What you’re describing is so interesting because it mirrors the same dichotomy in the industry when we try to understand developer productivity. This dichotomy is between looking at it purely from a quantitative and task- or activity-oriented view versus thinking about the human aspects of fulfillment, satisfaction, etc. It is so interesting to hear that that sort of divide also exists within the research community.

I want to move on to the second study you ran, which I find fascinating. But before we describe the study, I’d love to understand: you’d done this first survey study, then you opted to do this other study, so why do another study? What was the motivation, and how did the idea for it formalize?

Eirini Kalliamvakou: So one of the things that came up in the first study, as we were looking at the percentages, was something that caught everyone’s eye, which was that we had 96% of people say they complete repetitive tasks faster when working with Copilot. And we thought that what then comes next, the natural next step, is to measure how much faster we are talking about, which is something that we cannot get from a survey.

We can ask people in a survey, “How much faster do you think this tool is making you?” But we also know that humans do not estimate how much. They’re good at telling you that they think that there’s something here that they do feel that is helping them be faster, but not how much. So, the second study was specifically focused on creating the conditions of having people work on somewhat repetitive tasks and understanding what kind of speed gain they are getting from using Copilot.

Abi Noda: And what you’re referring to, of course, and this is how it’s described in your paper as a controlled experiment, which is a kind of research parlance, can you explain to us what is a controlled experiment and how it is distinct from other types of studies which may be related?

Eirini Kalliamvakou: In a controlled experiment, the stricter research terms mean that you’re trying to keep everything else equal between two conditions and only change one thing. And so controlled refers to you having controlled for every other factor except that one thing that you’re trying to study the effect of. It means that you have what we call a treatment group where you have them do the thing you’re studying and a control group that doesn’t, and you contrast the two. That’s a high level of what a controlled experiment is.

So, in this case, what it turned into for us was that we had 95 professional developers. We wanted to see how different developers, in terms of speed, are completing a task with and without Copilot. So, that meant that we split the 95 developers randomly into two groups. We gave them the same task, which is part of controlling the conditions. We gave them the same task: writing an HTTP server in JavaScript, which is somewhat repetitive. One of the two groups used Copilot to complete the task, and the other one did not. And we measured how long it took, on average, each group to complete the task.

Abi Noda: I want to ask you how you chose the task 'cause I can just imagine there must have been so much discussion and debate about the task. So, how did you arrive at writing an HTTP server in JavaScript, and what were some of the alternatives you considered?

Eirini Kalliamvakou: Honestly, I’ve blocked it now because it just took so long. I was not expecting it. I’ve done controlled experiments before. This was the first time that choosing the task took so long. Since what was surfacing from the survey and the research and the feedback that we had heard in the interviews and so on was, and that was our hypothesis as well, that Copilot seemed to be extra helpful when it came to boilerplate repetitive types of tasks.

We were looking for a task that is something that happens often enough that you need to do the same thing again and again but not too often that you remember every single detail about it, right? And so we brainstormed with the team, went back and forth, and we landed on that. It took us quite a while. I think it took us over a month and a half to just come up with the task. I think it is a good representation. It is not a universal representation of repetitive tasks, boilerplate, or development tasks in general. It was very much specifically formulated for that particular study. I think it was a good choice in the end. However, our criterion was to define what we mean by repetitive and find tasks that fall in that category.

Abi Noda: So then, what were the findings, and how much faster or slower were developers who used Copilot?

Eirini Kalliamvakou: Yes. So, we look at two things. One was whether there was a difference between the two groups, the one that was using Copilot and the one that wasn’t completing the task at all. And we saw a little bit of a difference. So, we call that the success rate for the task. So, for the team that used Copilot, 78% of that group fully completed the task versus 70% in the group that didn’t use Copilot. Then, for the completion time itself, we found that the group that used Copilot completed the task 55% faster. So, it meant that it took less than half the time it took the group that was using Copilot to complete the same task.

It’s one of those results that was picked up everywhere, and I can see why. And it’s the sort of result that not only sounds good, but then it also kicks off a lot of other thinking in our team in terms of what could that mean when you start aggregating it potentially to not just one developer but a team, an organization, an economy and so on. And I find that to be, honestly, the fascinating part of doing that research is that it kicked off thinking afterward about the implications of having that sort of productivity gain.

This is a research project that we also partnered with the Microsoft Office of the Chief Economist and MIT. That allowed us to be able to do some finer-grains analysis from an economist perspective to see whether in the group that was benefiting from Copilot, our treatment group, the group that used Copilot to complete the task, there, given the demographics, we had collected, whether there were groups that were benefiting more or less., which I also found interesting as a result, we found that less experienced programmers benefited more, which means that they were relatively even faster than their cohort of the group using Copilot. Again, it’s food for thought, which I feel is even more interesting due to the statistics that came out of this project.

Abi Noda: I found your point about thinking about the implications of this at the organizational level to be interesting. It’s a really good leeway into the next part of this conversation, where I want to talk about the practical implications of this. As I mentioned at the beginning of the episode, I just came back from this conference. Although this research exists, leaders still struggle to make these types of calculations or justifications within their own companies. And so I want to share insight and advice with these leaders.

To start with, one of the things I’ve observed as I’ve spoken to people across the industry is a dose of skepticism about AI tools. We’re all inundated with the press and the hype around them. Still, I was talking to a leader recently who said he’s a head of developer productivity at a large organization. To him, he thinks all this AI stuff is a bunch of hype, and the only thing that’s impactful is ChatGPT. So, I’d be curious to get your perspective on to what degree you agree and disagree with this and your view on the existing skepticism.

Eirini Kalliamvakou: So, I’m one of those people who, first of all, think that skepticism is healthy. I don’t know if it’s my training as a researcher or what it is, but I feel that it’s healthy to have a degree of skepticism and think things through. I will also say that it’s natural that the beginning of anything big and new feels like it’s all hype, right? And then one of the challenges is how long it takes us to see whether it is hype or something more permanent.

I would like to contextualize a little bit the statement that you heard. First of all, ChatGPT is one of those things that has captured everybody’s imagination and attention, right? It became so pervasive and a general-purpose tool that I would be surprised and shocked if somebody didn’t mention it as the AI tool or the one they find helpful. But it’s a tool in the sense that it needs to match the purpose, right?

So ChatGPT is one of those tools that is general purpose; it is conversational. It is also, we found, wrong a lot of the time, right? Because it’s not a source of truth but predicts what comes next. And so that makes it a good fit for some cases but not for everything. One of the things I hear the most often, and I’ve tried it out for myself that way, is a brainstorming tool. It is really good as a brainstorming tool where the accuracy matters less because if I’m trying to brainstorm something and somebody says something incorrect or inaccurate, that’s still food for me. It will likely move my thinking forward. And that’s what you want in brainstorming.

But for more specialized cases, of which coding is, I just don’t see it. And that’s not also what we hear in terms of feedback that ChatGPT is it. I will also say that we’re talking about code completion today, right? That’s not where things are going to stay. This is something that we’re discussing at GitHub and in the Innovation Team that I’m in, as we talk about developers who want to use AI to do more complex work and complete bigger tasks. Chat interaction is not even it. So, never mind ChatGPT as that particular tool, but just like a chat, it is not necessarily going to be how you complete larger chunks of work.

To the point we were making earlier, that at the beginning of hype, we can tell that something new is. Not everybody is ready at the same time for a big change. A great book that I read, I think, last year was called Crossing the Chasm by Geoffrey Moore, and it says on the cover that it is a marketing tool, but personally, I found it like it was an exceptional book that talks about how innovations get adopted. It has this incredible distribution, a chart that shows what we already talk about a lot of the time, that innovations diffuse at different times, and people adopt them at different times. So you have the innovators and the early adopters, but then you also have the pragmatists, the conservatives, and the skeptics.

So it could be that the person you talked to was in a category that will be convinced later, or perhaps we’ll never be convinced. Not everybody is ready for change at the same time. However, I would hope that, and I would urge leaders to look a little bit at this particular time that we’re in, and with AI getting adopted as quickly as it is right now, look at it from the perspective of the bigger picture. And if this were to be something that ends up not being hype, which we very much believe, where would you like to see yourself in this picture of adopting the innovation? Especially given how others around you, for example, your competitors, are adopting innovations. I think it complicates things a little bit in terms of decisions and so on, but it’s good food for thought for leaders.

Abi Noda: Let’s now put ourselves in the shoes of a developer productivity leader or any engineering leader who is excited about GitHub Copilot or similar tools and is interested in bringing this type of tool into their organization. And we’ve talked about the skepticism that probably exists within the organization. Later on, I want to talk about some fears that may exist in parts of the organization. But in terms of actually getting something like this started at an organization, in your view, what’s the best way to try and begin advocating for a tool like this?

Eirini Kalliamvakou: I would say we at GitHub, and especially our whole product organization, talk a lot with large enterprises that want to run trials of Copilot and roll out adoption. So there’s a couple of things to think about. One is, methodologically speaking, how you go about it. And I think another one is about how to set expectations, and a lot of the time, leaders might only look at the one, right? Bring in the right methodologies and expertise to do the trial, but not necessarily think about how to set the expectations.

So I think one of the things that I would say upfront is that I hear people talk about the return on investment, the benefit they can expect, and so on as if it’s one single metric or one universal concept, and it is not very much like any other developer experience initiative or digital transformation initiative.

As a leader and as an organization, you have something you’re trying to improve that led you to consider adopting a particular tool. And without answering that question, what are you looking to adopt? Sorry, what is it that you’re looking to improve? It’s very hard to find the right metrics to answer your question about whether this provides you with enough benefit. So I would set that as an expectation: a leader in an organization will need to define first what they’re looking to improve that brought them to potentially adopting Copilot.

And then, if we start thinking about metrics and methods and how to go about it, I would say there’s probably also no single metric you can go after when you think about improvements in productivity and performance. We mentioned the survey earlier, right? So, the survey that we use to assess the productivity improvements with Copilot. This is a survey that we now give as a raw survey to customers, and we say, “Before you do your trial, you start with this survey, and then you start rolling out Copilot, and then you go back, and you rerun the survey a few months after that.”

And I think since we’re on methods, for me, the best method or return on investment that you can get on methods will be with something like a developer survey. I am aware, and I know, and have used other methods, like the controlled experiment we just talked about or telemetry analysis and large-scale instrumentation. Those are considerably more expensive efforts. Expensive in time, money, and expertise. And I know you’re doing good work around that to spread the word that that’s the case. Still, I don’t know that leaders know that when they’re asking for objective data and robust data and the telemetry analysis, they’re necessarily aware of the full effort of that.

So, for methods, I would say developer surveys are probably the best tool, and they’re quickly becoming best practice. In terms of expectations, I would say start with what you want to improve and with the expectation that there’s not going to be a single metric that is going to get you there, either.

Abi Noda: I think this advice of using a baseline and another follow-up survey after an evaluation of a tool is really specific and actionable for folks I’ve talked to. Do you have any advice on actually doing that? I’m sure you’ve seen examples of this in practice just as far as timing and talking about the results with stakeholders. Any concrete advice you’d have on how to do this?

Eirini Kalliamvakou: So, I think before I do, I probably need to say a couple more things about setting another set of expectations, which is how long this will take. One of the things that I hear a lot is we’re going to try this for a month or two and see how it goes, and it’s just not enough time. AI, I think in general, and Copilot, it takes some getting used to. And it’s not so much that as it is; it takes some effort and some training, which is one actionable piece of advice that I would give. When you start rolling out a tool like this, you have to prepare the ground in terms of educating and training users because it takes a little effort to get to use Copilot to its full potential; what you want to do if you want to assess if this is a good tool for you.

And that’s not going to happen overnight. So, a piece of advice is some patience is needed here regarding when you introduce it and how long you wait to see a result. You asked about the timing. I have seen examples. So one example that is top of mind for me, which I saw recently, and I would say it was the most thorough evaluation of Copilot I’ve seen in any of our customers, is a tech company in England. I talked with their CTO, who had all the questions you’re talking about, right? How do we go about it? How long? And so on. And they had some metrics that they were getting in terms of the numbers of PRs and how long it takes for PRs to get merged. And they were thinking, “That’s probably what we’re going to use, right?” After we talked about it, they were already bought into the idea of doing the surveys; their questions were more about how long the survey iterations were.

So, with them, we talked about doing one as a baseline and then doing one every four weeks. This was a six-month trial that they were going to start with, right? So that means that it gives them a lot of time to allow enough time to see trends. The incredible thing was that they decided to use the retrospectives that they did with their teams to revisit how the Copilot adoption is going. And they found that to be the strongest signal. Layering that on top of what they’re getting from the surveys is ultimately, I think, the thing that convinced them.

I believe they stopped paying attention to the PR completion time type of metrics at some point. And I think that that was a healthy choice for them. I will also say that that’s another expectation to set that companies are different engineering teams and engineering systems are different. So, one company’s results that come in three months can take a year in another; they could be a different level or a different number. It is a tool that needs some tailoring to the purpose that you have for it. And so all this advice that we’re talking about is to do surveys, but make sure that you provide enough time and ask the same questions every time. It’s all practical advice, but I think there needs to be some kind of homework, for lack of a better word, that leaders need to do to make it work for them.

Abi Noda: Interesting to hear the example of integrating into retros, and I can see how that would provide a lot more qualitative insight on top of the survey data being collected. I want to ask you another question about how you take all this data that we’re talking about. When I look at the first study you conducted, for example, those percentage findings were compelling, but what’s missing, and what I hear from many leaders, is how do you translate that into dollars, right? How do you go to a finance person or the CFO and say, “This is making us more money or saving us more than it costs.” Any advice? I know there’s no obvious answer to that question, but what’s your view on approaching that problem?

Eirini Kalliamvakou: I will say that it’s not one person’s job to make that translation into money. And the other thing that I’m going to say is that I don’t know that that’s necessarily the right thing to do. I know that’s typically what organizations do for tools. Eventually, they want to translate them to dollars and cents, but I think there’s a little more to this one.

So, it goes back to what we said earlier about what you’re trying to improve. Without an answer to that, we don’t know what metrics to look at, or we could be looking at the wrong metrics, which means you don’t see a result or the wrong result in front of you. It’s the wrong number, and then you translate it wrong into dollars and cents. So, that intentional focus on improvements that need to happen and how this tool will drive those improvements is table stakes, right? That’s the essential first part.

That’s another expectation to add to our growing list of expectations. We should also be aware that Copilot, for example, is a tool that works when developers are actively coding. And research shows developers are actively coding less than half of their time, right? So that’s an improvement that can happen that can help the organization move forward in terms of dollars and cents that Copilot is going to help you identify as a problem, right? You will probably notice that people are not spending enough time in their editor to fully use the boost that the Copilot brings. When I say it needs intention from the organization and the leaders, it needs intention in terms of thinking about metrics and improving a little bit more holistically.

The other point I wanted to make, as I mentioned, is the return on investment. The way to look at this? I’ll go back to what I said earlier: Copilot and the AI tools we have right now are just what we have. It’s going to change, and it’s going to change sooner than we think. The whole picture of how engineers and engineering teams work will change very quickly. And so I understand the temptation and the impulse to look at something like Copilot from the perspective of dollars and cents. Still, I would also say maybe also look at it as the first step or the training wheels of you getting started with AI and using the adoption of Copilot as your first step in learning how to work with AI because it takes a little bit of figuring out and a little bit of getting used to.

And sure, you can focus on the productivity aspects as well. I’m sure that if you formulate correctly, you will see that there are productivity improvements that work for you, but also look at the benefits of getting started with using AI and AI tools in your organization because there’s going to be more tools and a bigger wave and evolution that is happening. And I fully understand that I’m saying that from the vantage point of being on an innovation team and having a sneak preview into things. But I would urge you to look at ROI also from the perspective of the cost of being at the wrong part of this wave of evolution where things have happened. Maybe your company stayed behind the times, or now it’s rushing to catch up to the competition because the competition adopted things earlier or more successfully or gave themselves more time to adapt.

Abi Noda: Earlier, you mentioned the importance of proper training and education when rolling out a product like this. I did want to ask you; I have heard leaders recently talk about having trouble or not feeling fully confident in adopting these types of tools within an organization. What is the best practice for introducing and rolling out a tool like Copilot across an organization?

Eirini Kalliamvakou: Yes, I think preparing the ground and providing training is key. After giving it enough time, I think it is the most key thing. One of the things that we have seen work time and time again is because what happens here is adoption in an organization is very organically driven, as happens with developer tools a lot of the time, is very organically driven by individuals who adopt the tool. Then they go into their workplace and say, “I want to continue using that tool because I’ve seen how more efficient and productive and happy it makes me.”

So one of the things that we have seen be successful is the first wave of individuals who adopted it and then advocated for it in the organization, becoming essentially the trainers for others for some time. And it’s also, I think, an easier way. I think it eases the tool into how teams work and eases the team into adopting the tool without anything seeming too radical or too top-down an approach because those don’t always work. So, having a group of people who were the early adopters inside the company or even outside the company but know the tool and have done that first wave of work to figure out what works and what’s best practice. Those being the trainers for a while, it’s something that we’ve seen organizations use successfully.

Abi Noda: Well, this has been a great discussion on tips and advice and practical examples of how organizations advocate for and roll out tools like this. I think listeners will find this valuable. I want to move on to two larger questions that seem to come up a lot when I talk with leaders. And the first one is just sort of the obvious ubiquitous question of how AI will impact the developer experience. Again, we are getting so much marketing on this, but can you give us some vivid perspective into the impact of these types of tools?

Eirini Kalliamvakou: I will leverage my perspective as someone who’s part of an innovation team. Our team is tasked with looking forward and trying to outline and prototype what the future of software development will be and how to leverage opportunities like AI and the best and latest capabilities. So I’m going to leverage that.

The question is about how AI is going to impact the developer experience. Let me start by asking how it is impacting it right now. That was a lot of what we just talked about, right? This is very much a second pair of hands for developers if they have undesirable, more drudge work tasks that they have to do; it helps them; AI tools right now help them get through these tasks faster. So it brings them the delight of having to do less of that boring, repetitive work, and it also saves them time. Right? This is what we have seen so far. And it is changing the developer experience from that perspective. Both the satisfaction aspect of productivity and also the speed gain and the time savings part of productivity.

I think that what’s coming next is AI tools that are going to be able to be, instead of being a second pair of hands, more of a second brain. And that means helping with more complex tasks, helping developers tackle complexity, and saving them more mental capacity. I am a big fan of that. If you let me talk about this, I’ll be talking about this for hours. But one of the things that come up, again and again, is how, with software systems growing so fast and every company becoming a software company, essentially there’s so much complexity, never mind the overabundance of information, we just cannot fit it in our brains anymore. Right?

And it’s not just developers. All humans have that challenge and ultimately will have that challenge. But let’s focus on developers at the moment. They are working with more and more complex systems that are getting bigger and bigger. Maintaining the mental model of systems in their heads to expand it, maintain it, and so on becomes a huge challenge. And it’s, I think, the next frontier for AI, which is like, “Okay, it saved us the time. Now, it’s going to save us some mental effort and some mental capacity.”

A lot of what we’re seeing as trends in AI also agrees with that. So we’re seeing natural language gaining more and more ground as a programming language, right? Which removes a lot of complex notation that people need to keep in their brains. We see AI agents that, at the moment, are able to complete bigger tasks on behalf of users, and they can work at different levels of autonomy. So we’re seeing all these trends, like vision models are coming, right? Or are you almost here? So these are all capabilities that will mean that AI can now be used for way more complex and impactful work.

And what that will mean for the developer experience is that developers get much more help dealing with complexity. And we see that they’re going to become the architects of systems. They’re going to be constructing things with AI, and that’s why I said earlier that that’s just a paradigm for how software development will be done in a year, two years, three years. It is going to be strikingly different. So getting started with Copilot to get your bearings in working with AI as a way to more naturally transition into what’s coming next seems like the real benefit, or part of the benefit, besides the dollars and cents.

Abi Noda: Interesting to get a teaser of what is to come with AI-based developer tools. The other fun question I always get is what the implications are on how we measure productivity. I was having a conversation with Nicole Forsgren a few weeks ago, and she said something like, “This is going to finally put a nail in the coffin on measuring lines of code if we’re using tools that literally generate lines of code for us.” Another interesting angle on this is data analysis or even collection, using AI-based approaches to collect feedback or insights from developers or analyze their feedback. So curious to get your view on that question.

Eirini Kalliamvakou: First of all, I very much align with this nail in the coffin of lines of code comment. Yes, one can hope that that’s where this is going. And I think that it is where that is going because those are measures of activity, right? And when you introduce AI, today is Copilot presenting you with lines of code that you accept, and then move to the next ones; tomorrow will be a whole new task. So, activity was already muddled, and I think it becomes even more muddled when you have AI in the mix. So, I expect it will be the end of some of these very surface-level metrics when it comes to activity.

I think that what we have seen so far in the research we’ve done in terms of effect guides us on where to focus next in measurements. I think things like flow are going to become fundamental, right? It will be the foundational thing that we look at and try to measure as an expression of how this AI tool helped me. I think things like switching costs, right? I mentioned how little time developers spend in the editor, which we’ve seen from day-in-the-life research studies. And they spend so little time in the editor for a multitude of reasons. Some of these have to do with just switching tasks or switching tools, right? They have to go to Stack Overflow to look for something, go to a meeting, and so on.

So, any help that AI provides, and it could be Copilot when it comes to coding, it could be other AI tools that you’re also using in conjunction, right? That summarizes your meetings or your emails, and so on. Switching costs, like task switching and context switching, will also be good candidates in terms of what to measure to express productivity.

I think that cognitive load is also going to be one of those. I’ve just mentioned I’m a big fan of that. And I think we’ll need to do some work to figure out the right metrics for those because we don’t necessarily have, or we’re going to have to do the work to look for the right metrics from other fields that have figured out cognitive load and how to measure it. But, I think that cognitive capacity or mental energy saved, flow, those sorts of things, like time in the editor, switching costs, will be the next wave of productivity metrics, which is a really good thing. Can’t wait.

Abi Noda: Eirini, this has been an insightful conversation, starting with a deep dive into your recent research, then practical advice for leaders trying to bring in tools like this, and then your thoughts on some of the larger looming questions we hear all the time. Thanks so much for your time today and for coming on the show. Really enjoyed this conversation.

Eirini Kalliamvakou: Thank you very much for the invitation. It’s always a pleasure to talk with you, and thank you for giving me the platform, right? I want to spread the word about some of these things and get the guidelines and the thinking out there. Thank you very much. This was wonderful.

Measuring and rolling out AI coding assistants

Timestamps

Transcript