Running data-driven evaluations of AI engineering tools

Laura: Welcome to the Engineering Enablement podcast, I’m your host, Laura Tacho.

It is no doubt that the AI tool ecosystem is expanding very rapidly right now. There are new tools popping up left and right in existing categories like coding assistants, but also brand new categories, everything from bug detection, load forecasting, test generation, context management, you name it. Organizations are really struggling to capitalize and not miss out on this new functionality with the need to not distract their teams. They want to experiment but also know that the news tools are going to actually bring a return on investment back to their business and not just be a waste of time and money. In order to figure out which tools are the right ones to invest in, especially when making big purchasing decisions, we need to root our evaluations in data. These data-driven evaluations are essential to decision-making when picking out new AI tools for your organization. They shouldn’t just be a check box exercise in your procurement process.

In order for these evaluations to be most effective, they should be rooted in reality and measured in order to get reliable signals to help you make the right decisions.

Recently, Abi Noda and I sat down and shared our approach to data-driven AI tool evaluations. We’re going to go over all the questions that you need to ask, like how do you pick which tools to evaluate, how do you judge them fairly and how can you use data to make better decisions about what to buy?

Keep listening to hear our take on how to run data driven evaluations of AI tools.

Welcome everyone to running a data-driven evaluation of AI engineering tools. I know this is a very timely topic for a lot of you as you’re working toward your budgets for 2026, you’re trying to figure out which tools we should invest in. There’s new tools popping up all the time and we want to help give you a bit of a blueprint of how to think about evaluating these tools so that you can make the right decisions based in data and really understand the ROI of these tools and whether or not they’re going to be effective in your organization. I’m joined here by Abi. Abi, do you want to introduce yourself?

Abi: Yeah, great to be here. CEO of DX, and yeah, I’m excited for this topic. This is something both of us are constantly talking with different leaders and companies about, so excited to dive in.

Laura: We want to start by talking a little bit about what the challenges are that we’re seeing when we’re out talking with our partner companies. I was at KubeCon last week talking with a lot of companies about evaluating tools, which tools should they pick, feeling like they’re missing out on something, and some of the common challenges that we’re seeing are missing out on new capabilities. So let’s say we signed an enterprise license agreement with tool A and then tool B comes out with some new capabilities and then we have a little bit of the FOMO, the fear of missing out. We want to know, should we be trialing this tool? Is this tool actually a better fit? Are we doing our developers a disservice by sticking with tool A when actually tool B might be a better fit?

Usually we do these evaluations as part of a procurement process or purchasing process, so that’s one way that companies are doing evaluations. So we want to fairly evaluate tools before making a purchasing decision. We might want some data to understand what the ROI is going to be so that we can project how this tool is actually going to perform organization-wide if we do procure it. There’s another one that’s a little bit less common, not about a new tool, but Abi, do you want to share that other challenge of evaluation?

Abi: Yeah, I mean we’re coming up on a year and a half, two years now, these tools being out in the wild, a lot of companies have been trialing them or signed one, maybe two-year contracts. So as the landscape changes, folks are constantly coming back to the surface to ask, “Okay, let’s take a fresh look at these tools. Let’s reevaluate.” So Laura, like you said, I think what I see is a lot of companies either at early stages who are doing really disciplined evals of a short list of tools, and then we also hear a lot of leaders come say, “Hey, we’ve been using tool X. We’re not sure about it. Should we be looking at tool Y, tool Z? And how do you go about that comparison?” So no matter where you are in your journey, I think you’re probably a couple to several months away from having to do a reevaluation, especially at the pace at which things are changing.

Laura: Absolutely. I think the reevaluation piece is something that leaders often overlook or don’t consciously think about because the triggering event just isn’t so apparent. We have a triggering event where this new tool comes out or I saw this news article or developer A came to me and said, “Look, new shiny thing. We could use this for A, B and C.” Whereas evaluating your incumbent tools isn’t really a muscle that we practice for a lot of our other tools. It’s not often that we’re reevaluating the efficacy of our CICD tool, for example, the incumbent one, but for AI tools because the market is moving so fast, that’s definitely a category or a time when you would want to reevaluate a tool. And Abi, how often in your perspective or opinion, how often should folks be reevaluating their incumbent tools just to do benchmarking or comparison against one another?

Abi: And Laura, you and I were talking about this and curious your thoughts, but I think we sort of highlighted it at eight to 14 months. I mean, at minimum if you’re signing one year short-term contracts with vendors, you’re going to come up for air every 12 months, so you need four months lead time before that renewal decision, so that would be every eight months. If you’re earlier in your journey, of course, do this now, but as you begin rolling out tools, I think eight to 14 months is a good cadence for reevaluation, but what are you seeing out in the wild?

Laura: Yeah, I think that’s just about it. For the companies that are consciously doing reevaluation, I think re-evaluation is sort of implicit with every challenger evaluation, and so I might use this terminology or Abi and I, we will probably use it, incumbent tool. So we’re talking about the tool that you’ve already procured that you’re using, and then we have a challenger tool, which is the new kid on the block. This is the new tool that you want to evaluate. I think every time you evaluate a challenger tool, you need to also by virtue have a good evaluation of your incumbent tool, which means you need data to understand how that tool is performing. So even if we’re not doing an evaluation all the time, we’re doing that 8 to 12 or 8 to 14 kind of target months. We want to make sure there’s continuous measurement and a lot of data around this, and we’ll get in exactly what that looks like in the next half hour together.

Abi: And Laura, there’s other triggers sometimes too. If there’s a population of developers who have organically gone off and tried out a new tool and are raving about it, that can push teams over the edge. Sometimes it can come down, top down from leadership. Sometimes there’s just a momentous change in the industry and we need to respond to that.

Laura: Sometimes an AI tool will have a private preview that goes crazy and everyone wants to get on the wait list, and then all of a sudden, oh, it’s in private beta and everyone can have access to it, and now all of a sudden we have developers using a tool that we maybe didn’t even know that was available to be used three months ago. I mean, I’ve heard all sorts of crazy things about this. I think just to sort of frame the rest of this conversation, the way that Abi and I recommend thinking about this is that the data-driven evaluation, so an evaluation that’s rooted in experimentation in data and results and outcomes is really essential to your decision-making to be able to say, “Yes, this tool is better for our org, or we’re going to stick with the incumbent, or we’re going to procure a different tool.”

It shouldn’t just be a checkbox in your procurement process. We can get a lot of valuable insight about usage patterns, about populations that are better served by these tools during the evaluation process. So think of it as a preview and as a controlled experiment to figure out how that tool can best help your organization, instead of just a go play and hack around with this tool and tell me how you feel about it. We want to be a little bit more scientific and disciplined in our approach.

Abi: Laura, like you said earlier, I think one action we’ll take away for folks is go look at your procurement cycle, your procurement calendar and start planning now for your renewals, your evaluations, and re-evaluations tied to that procurement calendar. Laura, I want to ask you when you get started on this, okay, we’re going to go shortlist and evaluate some of these tools. The first question is how many tools, which ones, should I shortlist? What’s your guidance around shortlisting and actually choosing the set of tools you’re going to try to evaluate?

Laura: I’ll tell you what I’ve seen out there. The biggest evaluation of simultaneous tools at the same time I’ve seen is six. Usually, what I see is one or two plus the incumbent, so we could say three. I think at this stage in the game, it’s very uncommon for an organization not to have an incumbent tool, and so when we get into a really true bake-off of six different tools, that’s been the biggest one that I’ve seen. I don’t know, have you seen anything bigger? Or I guess smaller.

Abi: I haven’t heard of more than six. I’m curious, do you think org size plays a role? I imagine if you’re trialing six tools at once, you’re going to need sort of different cohorts and population subjects, if you will, to try and trial them on?

Laura: Yeah, that’s exactly it. I think that’s a really good point to bring up. So the guidance that we’re giving here is pretty general, and you need to shrink it down or blow it up based on your organization’s size. The company that was trialing six at a time had a developer population that could support six simultaneous cohorts, where the only differing variable was the tool that they were using. So they had a baseline of performance, and then they introduced different tools to different developer populations with a pretty good mix of developers in each one.

You need a significant amount of developers to do that well and to get reliable results. That’s not possible for your fifty-person startup. Talking about cohort design, if they’re going to be opt-in participants, like, “I would like to try all this tool. I want to use it.” That self-selecting kind of person versus me saying, “Hey, Abi, guess what? You need to use Cursor for the next six weeks and I’m going to look at your PR throughput and send you a bunch of surveys.” There’s different styles of how organizations bring in people to their experiments or their evaluations, and then just the organizational capacity to manage and experiment is definitely something to consider.

Abi: It’s been implicit throughout this discussion, but when we’re talking about shortlisting tools, you may be looking at different categories of tools. So we may be looking at code review agents, we may separately be looking at IDEs. We may be looking at even models to some extent, depending on how your organization is structured around AI model usage. So in each of these areas, we need to run isolated experiments and evaluations, which I think we’re about to get into.

Laura: I think one of the things that’s been really consistent was with our guidance is that a multi-vendor approach is kind of the way to do AI strategy right now or to vendor selection right now for exactly the point that Abi shared. We’ve got so many different interaction modalities that don’t always overlap. I don’t think at this point in time, there’s really not one single tool that’s going to give you chat plus agentic IDE plus cloud agents plus code review tool plus, plus, plus, plus plus. Everything across the SDLC. You’re going to have to kind of mix and match, and then especially with the capabilities kind of leapfrogging each other every couple of months or couple of hours sometimes it seems, avoiding lock-in and being able to keep your options open, I guess for lack of a better way to describe it, is a good strategy so that we don’t get pigeonholed.

But I think that Abi, just to kind of close off that point that you made, grouping tools by use cases in the interaction modes is a really good preliminary way to structure what kinds of tools you should be evaluating. Because if we’re trying to evaluate an agentic IDE versus a tool that only has a chat interaction or IDE autocomplete, those are really different tools that serve different needs, and so that’s going to be really challenging to evaluate them against each other. They’re going to have different criteria, so we want to be thinking about the categories of tools, what the purpose is, what the business use case is as we kind of categorize them into, I’m going to experiment with this and this and this and keep them kind of separate as much as we can.

Abi: So all right, we got our short list of tools, we got the licenses, we put credit cards in, so now what? How do we actually structure the actual evaluation and execute?

Laura: This is where I see a lot of organizations with the best intentions go totally off the rails, because I think someone said it in chat, which is like, “I feel good about this tool, shouldn’t be a reason to switch vendors.” And organizations that approach a trial by saying, “What do developers think about this tool?” Are really not going to get the results or the data in order to make a very informed decision. Don’t get me wrong, satisfaction with tooling is an important part of developer experience, but there is a lot more to go along with it.

So the approach to take our recommendation is to approach every trial with a research question. So we’re trying to say, agentic, we’re trying to increase time for innovation. In order to do that, we need to reduce toil. Does this agentic IDE allow us to automate more work so that we can spend more time on innovation? That’s a good example of starting with the research question or starting with the goal first instead of just kind of letting everything blow into the wind and saying, “Well, let’s just see what we find out.” We want to have a really structured targeted question and a targeted goal in order to work backward from there to see if our hypothesis is validated or not.

Abi: And like you touched on, Laura, just asking a handful of developers, “Hey, which tool do better?” That’s something you should do, but the anti-pattern we see is that then that leads to a decision. It’s usually just a few developers, the loudest ones and the nearest ones. As you said, I think CSAT is a really important signal when measured broadly, not just a snowball sample of friends, but actual comprehensive CSAT baselining across the organization, and we’ve seen how CSAT correlates the utilization and adoption and an impact. We’ve seen that a lot in our data, but to your point, we really want to also be looking at outcomes, business outcomes. You mentioned more time for innovation. We see a lot of organizations looking at things like PR throughput, developer experience signals. And we’ll get into this more later in terms of specific scoring methodology, but what are some other types of things people should just be thinking about in terms of what are the outcomes we’re looking to drive?

Laura: Absolutely. So starting with, that requires an organizational position on the value that AI is going to bring to your organization, not just we want 10 X productivity gains and let’s see if tool X can bring us those productivity gains. AI by itself is not going to do anything. AI works by improving the system. We have to point it at a problem, so some organizations are pointing it at toil or reducing toil, increasing innovation time. Some are pointing it at increasing throughput. Other organizations, quite a few of them actually, are pointing it at reducing time for migrations and doing some legacy modernization projects and seeing if we can do that. Others are doing time to market.

AI will work with whatever you point it at, but we have to have a goal in mind, and the same is true for your app evaluations. So we need to kind of test if this is true in a realistic setting, which kind of gets us into the next thing that we wanted to touch on, which is like, okay, so we’ve got the tools, we’re evaluating. We have a business case that we want to see. Is this tool good for X or Y? But how do we actually choose who gets to participate in this trial? And actually I think this is the most important decision about the quality of the data and the quality of the outcomes that you’re going to see, so then therefore the quality of your decision.

Abi: I think there’s a few different considerations. So there’s an opt-in format or where you’re rolling out across teams. I think one thing that’s really important, we see this in our data over and over again is you really need diversity of subjects in your evaluation. We know that junior, senior engineers, engineers working on different types of programming languages and technologies, right? Their experiences, their results vary widely depending on often affected by the different tools. So I think diversity is one really important consideration. In terms of the opt-in versus team role, what have you seen work most effectively there, Laura?

Laura: Yeah. Again, the results are just really varied and different organizations value different things. So I work with one organization who they really value autonomy more of a democratic style, so the opt-in was the right thing for them. They wanted to opt-in, and then they did a lot of enablement and training to get people to see the value so that they would opt-in so that they had the 90% adoption. I’ve worked with other organizations that say, “We want the diversity, we want this to be as true to an organization-wide rollout as possible where one tool is available for the whole org, and so we’re going to chunk off this section of our organization and give them all access to the tool.”

In that case, there are still going to be the people who would’ve opted in who are naturally have a tendency to be early adopters of tools, but we’re also going to get the people who are hesitant or skeptical or struggle with onboarding, and those are the people that are actually very valuable to have in your trial because you’re going to uncover what kinds of support enablement and training you need to offer to those developers in order to get the level of adoption or the level of impact that you want.

I cannot understate this enough. It is so critical to have that information coming from your trial. If you have a only opt-in, and again, this might work for some organizations and only you can be the judge of that, but when we have only opt-in folks, we’re getting people who already were willing to use the tool to begin with. We’re not getting data on the cases where the tool doesn’t work, and that’s not reality when you do an org-wide rollout, if that’s the plan for the tool. So just keep that in mind with how you’re selecting these cohorts.

Abi: And you touched on this, Laura. I think you touched on support and enablement. When we’re doing these valuations, rolling out tools, we have to account for training and getting people onboarded onto these tools. What I’ve seen is that it’s important to offer support and enablement. I mean, what that looks like exactly is up to your organization, but at the very least, like a one-pager Wiki for each tool that’s delivered and made available to the folks who are trialing these tools. I think it’s important to treat each tool equally. Meaning of course if you put in 50 hours of enablement into tool A and one hour of enablement into tool B, probably going to see better results of tool A, so we want to control for that variable, but I do think enablement is important. Even in an evaluation setting, you are simulating what it is like to actually roll this tool out in your organization, and enablement is part of that journey.

Laura: We see in our data that organizations that invest a lot in structured enablement have better results across the board. It is really essential, so we want to have the individual proficiency of the tool.

We also need to think about training and enablement at the team level as well because we want to enable team workflows, not just use AI at the task level. It’s another argument for doing the team level cohort design because we unlock a lot of opportunity there when we can have whole teams using the same tool versus if you just have a couple people from a team using one tool, we’re kind of limited as to what they could use that for. It’s going to be more at the task level, and that’s not really where the most productivity gains are to be had at the task level. It’s more of a systems-level problem.

Abi: So we’ve got the tools, enabled, rolled out to developers or different groups or cohorts of study subjects. How long do we run this experiment for? Then how do we understand who’s winning or what do we actually choose?

Laura: Time constraint needs to be there. We want to pick a time that’s not too short. If your rhythm of business is… I’ll just use the two-week sprint as sort of the example here. We don’t want that trial to be three weeks because then we only give people one sprint and a half to get used to using the tool. We want to give sufficient enough time where we can actually see the training and enablement and the practice just build on itself. At the same time, we need to draw a line in the sand. I’ve typically seen anywhere from eight to 12 weeks, kind of like a quarter or so. I think on the highest end, I might’ve seen six months of trial, but I think keeping its scope to that eight to 12 weeks, we want to get enough kind of cycles in. Is that any different from what you’ve been seeing, Abi?

Abi: I think it’s just really at the beginning we talked about what’s the… Keeping in mind your procurement cycles and calendar for scheduling these evaluations. We also have to keep in mind the duration of the evaluation, so if we’re looking at eight to 12 weeks on the eval, going back to what we talked about, you’re looking at probably a six-month lead time to purchase, so again, it doesn’t change anything we’ve talked about, but just people need to keep taking into account the evaluation duration as well in timelines. Back to the evaluation, we do the eight to 12-week eval. What are we looking at as far as measurements, data? How are we picking who wins and are we looking at one thing? Are we looking at multiple criteria?

Laura: Yeah. If you are already measuring developer productivity with a framework like the core four, you’re already out ahead because you have baseline, you have a common language of what engineering performance means for your organization and you can see the difference. You can segment the cohorts out and you can see the difference in their core four metrics based on AI, whatever AI tool they were using or if they were using one at all. If you are brand new to measuring AI, Abi and I put together a framework called the AI Measurement Framework, very aptly titled, where we sort of talk you through how we think about measuring AI at an organization level. Again, I can’t stress enough that baseline measurements are critical. Getting your baseline so that you have something to compare to is the single best thing that you can do to figure out what effect the tool is going to have because if you don’t have that baseline, it’s really difficult to compare, so we want to have core four metrics in place to talk about impact.

We’re going to look at utilization. We want to break all these out by attribute as well, and that’s what, Abi talking about the importance of diversity. We want to look at how is a senior engineer, how is a senior engineer doing in compared to a junior engineer? How does their tenure, their seniority, their laptop age? There are just so many things that we would want to break out these measurements on. Important is that we have a standardized scoring criteria across all of the tools, so we’ll look at AI measurement, AI impact metrics, which you can find in the AI measurement framework. Within DX, we have a nice AI impact dashboard that shows you things like PR revert rate, PR throughput, change failure rate, time spent on innovation. There’s lots of dimensions to evaluate, and that will again roll back up into the original goal that you had of evaluation, but aside from those, we also have just the very nuts and bolts evaluation criteria for your procurement process.

Security. Is this secure? Does it align to our security policy? What’s the price? That’s a big one. These tools aren’t really getting any cheaper and they can be definitely quite an investment. Some other ones that I’ve heard are looking at support. How much support does it offer? What does their roadmap look like? What could we expect this tool to do in the future? We’re not just buying the tool, but we want to continue to be a partner. Are we going to have influence on the roadmap? All of those things are part of the same procurement process that you would use for any tool. We want to make sure that we’re treating AI tools with the same rigor, even though they’re AI tools and it’s exciting and there’s a lot of change, but we can’t not do our due diligence from a procurement perspective just because of the category of tool.

Abi: And as we’re doing these evaluations, you’ve talked about this already, Laura, but really trying to make the evaluation process as close to real-life development as possible, right? We’re not trying to evaluate these. Suddenly everyone’s just creating boilerplate prototypes and showing off what they built in an hour. That’s not really simulating why we’re looking to purchase these tools and bring them into the organization, so really important that as we’re working with subjects and as we’re overseeing the evaluation process, that we’re not having people go and do things that are abnormal or not really part of the core SDLC. We really want to simulate real-world work with these tools and then evaluate based on those results.

Laura: Yeah. I will release this a podcast with Nathan Harvey from Dora, in our conversation we also talk about this experimentation versus real life work. I think there’s a temptation that we could spend 80% of our time experimenting with AI every week, but then when does the actual business work get done? And so we need to balance that experimentation and keep space for it definitely, but also balance that with the reality that we have to get the business-as-usual work done as well, and we want to evaluate these tools in it as close to a business-as-usual environment as we possibly can.

Abi: Oh, I was just saying, I think it’s time to kind of close that wrap up and we can get into questions, but I know Laura, what you’ve emphasized, what you’re always emphasizing with the customers and companies and leaders is just this idea of starting with a question. Don’t just go into, “Hey, we’re trying these tools to see what happens, to see what we find, to see what we can see.” I mean, that’s important too because there’s a lot of open possibility with these tools, but you want to approach these evaluations, these trials in a structured way with research questions, backing into it with data, running fairly scientific cohort-based experimentation. I think that’s the main message that we’re both sharing with leaders today.

Laura: Yeah, so think about your who, what, where, when, why kinds of questions. When should I be evaluating this tool? Are you evaluating it at the right time? Is there a procurement question coming up or a procurement event? Are you late to the game? Should you have evaluated this tool perhaps more rigorously before you procured it and now you’re trying to go back and prove out some of the ROI? Which tools should you evaluate? Of course, solicit information from your developers, which tools they’re interested in, but keep a big perspective on different use cases and also different interaction modalities, like we might have an agentic IDE, but the code review tool, and those are going to be really… Even though they’re AI tools, they’re different classes of tools.

Structure the evaluation, make sure that we’re evaluating against common criteria, use core four metrics, look at our AI measurement framework and then also things like security, costs, support offering, roadmap influence, those kinds of things that you would do for any other procurement, and then just make sure that your measurements for evaluation line up to that original research question. We don’t want to just go out and see what we’re going to see. We want to have a target and treat it really scientific. Measurement is really, really key to doing this all. If we don’t have good measures, and good measures can look all kinds of ways, but without that measurement, we’re not going to be able to draw accurate conclusions.

Great. Why don’t we jump into Q&A and Abi, I’m going to take the first one.

“How reliable are engineer self-evaluations or estimates of time gain from GenAI coding assistants?” If I go back, let me just share the AI measurement framework again. If we go back to the AI measurement framework, we do recommend AI time savings as a very high-level leading indicator to show whether or not a tool is helpful or not as one aspect of many things that we should look at. So not only the time savings alone, but as a data point and a collection of metrics. And this is a really interesting question because when I released the Q4 AI impact report, which we’ll drop a link to in the chat, someone from DX will do that. I’ve noticed that over time, the time savings from AI tools isn’t actually going up, even though adoption and usage intensity has gone up a lot, and I don’t think it’s because we’re saving the exact amount of time as we were before.

I think it’s because our people, human beings, are not great at estimating time, I think in general. And for a lot of developers, they have been using these tools in their daily work for a year or more, and so when they go and try to estimate how much time did I save by using AI, they’re not referencing a value that is pre-AI. They’re referencing a value of how they used AI six months before, and so it’s really hard. It’s becoming really hard as adoption has just swollen to 90, 95% to actually do a very good analysis of the before and after of no AI versus AI.

In DX, you can do the before and after, so I could look at Abi’s stats before he used a tool and then after he used a tool and compare him kind of to himself or compare cohorts to themselves collectively, but time gain from AI, coding assistance, self-report, humans are not great at estimation of time in general, so always think about that, but if we’re using it as sort of a high level to directional indicator proxy for, is this tool useful overall? I mean, I certainly don’t want to see that it’s saving no time, so you just have to match the decision that you’re making with the data and the preciseness of the data. Time savings is a very big, high-level kind of metric that can be an early indication of whether a tool’s useful.

In order to get very precise measures, we have to do some pretty invasive measurement, which we don’t support. I don’t support personally. Like screen recording, that’s how we get some of those measurements in academic studies. We don’t want to go that route. So self-reported time savings is the alternative. As imperfect as it may be, we still believe it is a good directional indication of usefulness of the tool.

Abi: Yeah, let me just add. When you’re comparing across different tools, even if there’s some systemic error in that self-reported time savings measurement, longitudinally or cross-sectionally across different tools, it’s a reasonable comparative metric to understand different tools, but plus one to what you said, the METR study really had some interesting insights around the accuracy of developers’ self-reported time savings. I’d love to see any PhD students out there or future PhD students. I think that would be such a great… The meter study is the only real observational study comparing observational data against self-reported data around time savings. It’d be so interesting to have more research around that topic in particular. But yeah, your advice is spot on, Laura.

Laura: Absolutely. All right. You want to take one, Abi?

Abi: Yeah, I’m going through the list here. “Why go for just one selected tool? Isn’t an option to get more than one and give users the option to use the tool that best fits their needs?” That’s a really good point. Even the results of an evaluation may not be pick one. It may be, “Hey, we’ve actually identified that different tools are better for different purposes or different groups of people.” Qualitative data here is really useful to complement the quantitative measurement to understand these types of variances. So yeah, it’s quite common today, especially at larger organizations, to standardize on multiple tools. Now, that doesn’t typically mean every tool out there. It’s still usually a short list. That’s important just for procurement reasons, like negotiating discounts and contracts with vendors, but I think arriving at a point where, hey, there’s multiple that we feel are worth having a relationship with and green-lighting in the organization is a perfectly normal and reasonable place to land with your evaluation.

Laura: There’s two I think that I’ll answer together. One is, “Have we seen a significant difference in performance that is attributable to a tool of choice?” And then another one from Samuel, “One question I currently have is whether it’s too early to choose a particular vendor and commit to it. I feel like these tools are quickly evolving, and new ones are created every month.” We feel that too. There’s definitely a lot of whiplash.

So this graph that I’m showing is from the, it’s like a heat map from the most recent Q4 AI impact report that we published last week. What you’re looking at here is broken down by tool PR throughput from a sample of 32,700 developers across 170 companies using those tools, so a pretty wide variety of tenure, language, and seniority, quite a lot of diversity in these sample sets. To answer the question, when we look at this data, I think it would be easy to draw conclusion that claude code and Cursor are outperforming every other tool or that they’re better tools are more capable than, for example, Windsurf. And I just wrote a newsletter article about this if you missed it, but I go into very deep analysis here. We are, I would say in general seeing that agentic IDEs are slightly edging out other IDE code completion tools or other tools, but not at an overwhelming rate.

And in fact, the differences in this graph can be more easily explained by things like organization size and complexity than the tool itself. Anecdotally, what I’ll share is that I’ve noticed that any tool with the right amount of support and enablement will be more successful than any other tool without it. So there are definitely going to be exceptions to that rule where we want to have very specific model or tool capabilities to solve a very specific problem. But when we’re talking about AI assistance in general, it’s kind of like the grass is greener where you water it in terms of impact. We are seeing a slight trend toward agentic IDEs being favored by developers. I know I personally prefer to use claude code. I feel much more productive with claude code than I do, for example, with Cursor or with a different tool. That’s just my personal preference. Developers will have their own preferences as well.

To answer that second question about is it too early to choose a particular tool and commit to it, they’re quickly evolving, that definitely is the case. So this graph is pretty… I wouldn’t say it’s normalized because these tools have been around, but if we looked at this graph from two quarters ago, half the tools weren’t there or we didn’t have enough data on them. Things are evolving really rapidly, and I would expect that in two quarters from now, there’s going to be another tool on here that doesn’t exist yet either. So the multi-vendor approach is something that we suggest exactly for this reason, to keep your options open and to be able to capitalize on new movements in the market.

Abi: And I just want to add, with this visual, this is not a DX endorsement of any particular tool. The data is a snapshot in time, as we’ve talked about today. It’s really important. Create your own short list and go evaluate these tools in your organization. There’s no one size fits all winner for these tools. Every organization is seeing different results based on their context.

Laura: Absolutely. Very well said, Abi, thank you for the reminder. One thing that just to kind of close out, the big idea for today is that the most important data to look at when it comes to tool performance is the data from your own organization, and that means continuous measurement is just so important when evaluating tools. We want to have great baselines. We want to have diversity in our cohorts, data-driven. Start with a goal and make sure that you have measurement in place to really understand the delta between someone who’s using this tool and someone who’s not or using a different tool.

Treat this like a science experiment. Come to the table with a goal in mind for your evaluation, and that’s going to lead you to more reliable data. That’s all from us. Appreciate you joining us. Thanks everyone, and we’ll see you around the internet. Take care.

Running data-driven evaluations of AI engineering tools

Show notes

Data-driven evaluations are essential

Choose the right set of tools to evaluate

Re-evaluations are essential, not optional

Design trials around research questions

Select representative participants

Run evaluations long enough to capture true behavior

Use self-reported time savings carefully

Expect variation rather than a single winner

Timestamps

Transcript