Skip to content

Unpacking METR’s findings: Does AI slow developers down?

Kali Watkins

Product Marketing

This post was originally published in Engineering Enablement, DX’s newsletter dedicated to sharing research and perspectives on developer productivity. Subscribe to be notified when we publish new issues.

A recently released study from METR found that developers were actually slower when using AI coding tools than without them. The finding was unexpected, not just for the researchers conducting the study, but for the developers participating in it as well.

Last week on my podcast, I was joined by Quentin Anthony, one of the study participants, to get his firsthand perspective on what it was like to be part of this research. Quentin brings a particularly compelling viewpoint to these findings: his work involves training AI models; he was an early adopter of AI coding assistants; and he actually experienced speed-ups from AI during his participation in the study.

We talked about why, despite some individual speed-ups, the study still found an overall slowdown. Quentin shared why developers may think AI is helping more than it is, and offered tips for using AI tools effectively based on their current strengths and weaknesses.

Below is an excerpt from our discussion, edited for readability. You can listen to the full conversation here.


How was the METR study conducted?

Quentin: METR does a really good job of not affecting your workflow. They just wanted to observe, so they said, "We see you’re doing this work on GPT-NeoX. Can we just have you write a detailed journal over what you’re already doing, write the issues that you already planned to do, and set up a little bit of a roadmap of the features you want to add? Then we will categorize them and tell you which ones you can and can’t use AI on. You can use whatever AI tools you want.”

I’ve heard some assumptions about the study that we had to use Cursor or Claude. That’s not true. In the paper, Cursor and Claude are frequently mentioned as examples of tools that many participants used, but they weren’t required.

The researchers had me write a list of all the issues I wanted to tackle within the next couple of months. I created GitHub issues for these, and then I estimated how long they would take me with AI and also without AI.

In general, I felt this study was representative of my day-to-day work. I used AI about the same amount as usual, on the same codebase.

What was your reaction when you found out about the results?

Quentin: I was getting sped up by AI, so I thought that’s what we’d see from everyone in the study. At the end, they sent me the early draft of the paper and said, “We think you’ll be very surprised.” And I was. I was reading the paper myself thinking, oh my god, it’s an average slowdown—totally unexpected. Because in all of my interviews and discussions, I had been talking about how great AI was and how it was speeding me up as expected. It was shocking to me, and I think to the study designers as well.

I think people overestimate speed-up because it’s so much fun to use AI. We sit and work on these long bugs, and then eventually AI will solve the bug. But we don’t focus on all the time we actually spent—we just focus on how it was more enjoyable to talk back and forth with an AI than it was to grind through a bug myself that no one else can really help me with.

I think the main contribution from METR is that they sat down and actually put data to the claim, and were willing to publish even when the results weren’t what they expected.

Some critics of the METR study suggested the participants just weren’t good developers. What’s your take on that?

Quentin: I read through the public GitHub issues from the study, and these devs were solid. I think people confuse “not being great at prompting” with “not being a good developer.” But prompting skill is different from engineering ability.

The bigger issue is that some tasks just don’t map well to what models can do today. The study is really about task-model fit, not developer aptitude. You can be an experienced engineer and still get slowed down by AI, especially if the model isn’t suited for what you’re trying to do.

The bigger issue is that some tasks just don’t map well to what models can do today. The study is really about task-model fit, not developer aptitude. You can be an experienced engineer and still get slowed down by AI, especially if the model isn’t suited for what you’re trying to do.”
- Quentin Anthony

Why do you think AI tools felt faster but actually weren’t?

Quentin: I think there are tasks like documentation, unit tests, refactoring—things AI can one-shot, and it feels really good because these are plumbing tasks that we all have to deal with as devs, but don’t want to do. The first time people feed this to an LLM and get it back in 10 seconds with their problem solved, it’s so enjoyable that you try to apply that approach more and more to other tasks in your workflow.

There’s also a sunk cost fallacy. I’ve already spent 30 minutes trying to get this model to even write me a scaffold for a kernel. Do I spend 30 more minutes hoping it’ll figure it out? Or do I accept that I wasted 30 minutes on this task that I’m already stressed about, and start from scratch as a human?

I think developers need to fast fail. So, time box when you’re working with an LLM. I will try to get the LLM to write this function, but if in 10 minutes it’s not there, I need to stop myself. I need to go write it myself, and then maybe come back to the LLM or try with another model.

Time box when you’re working with an LLM. I will try to get the LLM to write this function, but if in 10 minutes it’s not there, I need to stop myself. I need to go write it myself.”
- Quentin Anthony

You’ve talked about “context rot.” What is that and how should developers handle it?

Quentin: I didn’t come up with context rot, I read about it in a Hacker News thread. Context rot is when you increase the context with irrelevant information, and models start to get overwhelmed with irrelevant information and become confused. We can intuit this because it happens to humans, too. If you read a super long paragraph while trying to do a reading comprehension task, and if the key information is buried within a bunch of irrelevant or confusing rabbit holes, then it’s harder to pull out that fact than it would be if it were just two sentences.

So what developers should do is create new chats whenever the model starts hallucinating or repeating itself. You can start to see signs of models getting confused about what was earlier in the context. Whenever you see that, you should break the chat. Models are really good at summarizing, so you can feed the chat into a model and say, “Summarize this entire conversation into its key points.” Take that summary as your next prompt and start a new chat.

What’s your guidance for organizations that want to provide guidance to their developers on using AI tools?

Quentin: Organizations should focus on specific tasks for using AI. I sometimes hear the claim that larger organizations are worse off for using AI, but I don’t agree with that. I think the real issue is when we use AI for tasks that it’s currently not well-suited for.

It’s obvious that people should be using models for refactoring, test cases, documentation, and code commenting of existing code to make it more human-readable. These are tasks any model can do today—it’s very straightforward, and any company should be using AI for this. But it’s the code generation and producing novel functions where things get tricky. There’s a reason METR is focusing on research, because it’s at the frontier of what models cannot yet quite do, and it’s a common failure mode for a lot of models.

What I’m trying to emphasize is that tasks are how people should be approaching this problem. The difference between whether AI speeds you up or slows you down isn’t necessarily about the codebase or the team, it’s about what specific tasks you’re using the model for.

Another piece of advice is that developers need to self-reflect. They need to step back and think: Am I actually being sped up by this? Is this model producing good code?

With models evolving so quickly, how do you decide when to switch or try something new?

Quentin: ​​I keep my model horizon small and don’t change very often. When I do switch to a new model, I change very little at first and then gradually adjust my workflow. I think these developer hygiene practices are the most effective approach—it’s just another form of digital hygiene. Developers now need to be thoughtful about AI tools and how we interact with them.

Let me clarify. I’m not saying we should stick to old models and never switch, because there are exciting new models coming out, and I use them too. What I’m saying is that when I move to a new model, I start with tasks that I know very well. The key is tiptoeing into new models while being hyper-aware of when they’re failing and looking for patterns in those failures. Don’t dive in with everything at once, because then there’s too much changing simultaneously. You can’t identify the model’s specific failure modes, and everything becomes a mess.

Do you think we’re ready for agents to handle complex development workflows, or are we getting ahead of ourselves?

Quentin: When navigating a huge space like this, it helps to be very grounded and not follow extremes on either side. Agents won’t take over everyone’s jobs, but they’re also not some totally spurious, silly thing.

I think agents will be very important going forward. I’m working on them myself. Being able to actually offload an entire process instead of just prompting a model is super important. But the technology isn’t quite there yet.

I think we’re in an early ChatGPT moment with agents, where the distribution of useful applications is still pretty small. This will change over time, but right now the technology needs more development before agents can reliably handle complex, multi-step workflows.

So while I’m optimistic about the future of agents, I think we need to be realistic about their current limitations and focus on the fundamentals (like properly mapping work to tasks and tasks to models) before jumping into more complex multi-agent paradigms.

Listen to the full conversation with Quentin here.

Published
August 6, 2025