Unpacking METR’s findings: Does AI slow developers down?

Abi: Quentin, thanks so much for coming on the show today. You’re a lifelong developer, researcher, and model developer, and participant in this recent METR study, so really excited to get your experiences being a part of this study and your thoughts on everything that’s going on in AI software engineering today. Thanks for coming on the show. Let’s start off today with you just giving folks a brief background on your professional background and what your focus is today.

Quentin: Yeah, so I am the head of model training at a company called Zyphra. So, we are in the pre-training business. We built a lot of small and more recently larger models. And then a lot of the work that I did for METR itself was under my Eleuther affiliation. I’m head of HPC at Eleuther AI, which is a nonprofit research lab where we also train models. We also look at interpretability of models. And then before then, I did all my PhD work at Ohio State University.

Abi: So, METR for a lot of folks, including myself, it’s actually the first time I’d heard of METR was through this study that you were a part of. You’ve known the METR folks for longer than I have. What should people know about the organization, their mission, their focus?

Quentin: Okay, so I mostly know METR just from as a developer, so they came to me before about getting bugs. I don’t even know if that ended up working out on my side. But I was already in the system and I was an active developer on GPT-NeoX, so they reached out to me again for this study. In terms of who METR is, I believe they’re more of a nonprofit research or trying to look at AI capability of automating research. So, as an AI researcher working on AI R&D, all the time, they’re always trying to find out how well AI automates my job already and how well they can forecast how my job will be automated in the future.

Abi: And you mentioned your background earlier, but you’re also someone who actively writes code, professional developer. Is that right?

Quentin: Yes. I actively develop on GPT-NeoX as well as my internal code base is at my job at Zyphra.

Abi: And so when METR approached you to potentially participate in this study, what did they explain? What was their stated objective goal? What was your initial impression of what this was all about?

Quentin: So, METR does a really good job at not affecting your workflow. They just want to observe, so they said, “We see you’re doing this work on GPT-NeoX. Can we just have you write a detailed journal over what you’re already doing, write the issues that you already planned to do, maybe set up a little bit of a roadmap of what features and things you want to add? And then we will categorize them and tell you which ones you can and can’t use AI on, and then just use whatever tools you want.” So, people will think often in the study that we had to use Cursor or we had to use Claude. It’s not true. In the paper, Cursor and Claude are mentioned a lot as examples of tools that a lot of participants use. I, for example, did not use Cursor at all. But again, “Just use AI however you want on the issues that we say you can and can’t on the issues you already plan to do on your roadmap.” That’s all.

Abi: Out of curiosity, what is your tool of choice day-to-day for development? What tool did you use for the study?

Quentin: Yeah, I use mostly LibreChat, a local chat interface that I can directly switch between the APIs of each model. Claude and Gemini and o3 are my main three models of choice right now. And while Cursor is nice in that I don’t have ever have to leave my editor, I don’t really know what’s happening to the context. And it’s already a bit of a challenge, I mentioned this a bit in my Twitter thread, to find how these models respond to context when they start getting overwhelmed with the context. And I want total control over what I’m feeding the model so I can keep a live feed on when it’s becoming confused, when I need to change how I’m prompting. And that opaqueness in Cursor and tools like that are why I tend to avoid them.

Abi: Well, let’s get into the study itself. So, you mentioned they reached out, they asked you to make a list of your upcoming tasks. From what I understand from the paper, one of the first things they asked you to then do is provide estimates for how long these tasks would take. So, tell us about that process and that journey there.

Quentin: Sure. So, they had me write a Google Sheet of all the issues that I want to tackle within the next couple months. So, I did that. I created GitHub issues for these, and then I estimated how long they would take me with AI and without AI. Because for some tasks, I am cross-stack on the AI research, so sometimes I’m writing GPU kernels, very low-level stuff and PTX instructions and stuff, and then sometimes I’m just writing unit tests and documentation. And I tried to put my issues on the roadmap, a smattering across these. And for some of these, documentation and AI were one-shot. So, my AI estimated speed up was very high. And then in other GPU kernels, those are added distribution for models right now, so I put the speed up is pretty low. But now, I have this table of issues, how long I expect them to take and everything else, and then they go through and say, “Okay…”

I also tell them what issues I’m tackling next, the next batch. I can commit. I’ll do these next 10 issues, for example, and then they’ll go and say, “Okay, you can use AI in this one, but not this one, this one,” so on. Then they have a mapping, and then I record my time on each of those issues with and without AI, and then write very detailed notes about when I wasn’t using AI, where AI could have sped me up? When I was using AI, where did AI slow me down? Where did it struggle? Did I choose not to use it for certain… Because I don’t have to use AI within the AI-allowed issues, for example. So, just very detailed notes, and then they were able to write the paper from all this data.

Abi: Generally when you were putting on the estimates, what was your experience at that moment? Were you feeling pretty confident in those estimates? Were you unsure, based on your experience, the reliability of using AI tools or experience using AI tools?

Quentin: So, I was already doing the calculation in my head of how much of this is going to be one-shotted, because I’ve been using AI for a long time. I was an early adopter back in, I don’t know, 2021, 2022 I started really using these heavily, and I found these failure modes that I talked about pretty early on. I knew that AI would give me a speed up in most of these issues. And I already was accounting for, the sub-tasks it would not speed me up. So, I think I did a good job of estimating my speed up.

Abi: And to add, this was on a code base you’d already been working in, so you not only had an idea of the benefit you could get from AI, but really practically applied it to this particular code base and the tasks that you were going to work on.

Quentin: Yeah, this study was perfectly representative of my day-to-day. I used AI about the same amount on the same code base.

Abi: So then you started working on the tasks. Tell me about your experience with that. As you’re working on these tasks, are you in the back of the mind thinking, hey, this is basically going as I had predicted? Did you have any insights while you were actually completing the task?

Quentin: I think I get into a pretty deep flow whenever I’m working, so I wasn’t writing anything down in journal, because that would also take me longer to do the task if I was constantly writing things down or pulling myself out of flow to think about, oh, step back and think how things are working. But very quickly after the task was done, I would sit down and I would write very detailed notes, maybe a paragraph or two about everything that could have been sped up, everything that was sped up, things that might have slowed me down. So, just reflect on, observe, self-observe a little bit over what happened.

Abi: So, after you completed these tasks, what was the next step as far as your participation in this study? I know that the METR team was analyzing the video recordings of you working on these tasks. What did you hear from the METR team after initially completing your work?

Quentin: Yeah, so it was issues would be drip-fed both ways. This started back in January, February time of this year, and I would work on a few issues, then I would say, “These are the next few I’m going to work on,” and then they would say, “Okay.” They would AI label them as I could use or could not use AI tools and so on. And that just repeated, that cycle repeated itself a few times.

So at the end then, we also had catch-ups almost weekly, especially early on. How are things going? How’s the study going? Can we improve in any way? They also offered several times to give us tips on using Cursor and other tools, because I think finding this result emerged throughout the study. It wasn’t the expected result. The expected result was, oh, of course we just confirmed what we already know, AI speeds everyone up. I think people say a lot that METR is trying to get hype by fishing for this result by trying to pick people out that aren’t good using Cursor and AI tools and then say, “Hey, look, AI tools are bad and we should all stop.” Definitely not the case.

So, the case actually was that we weren’t asked how long we use Cursor until the end because they found there are people that don’t know how to use Cursor and got slowed down, specifically because they don’t know how to use the Cursor tool. So, there was a long exit interview where we talked about our previous experiences using AI tools, and then to summarize our own experiences both within the study and outside of it, if we improved on our ability to use AI tools within the study, that sort of thing.

Abi: What was your reaction when you initially found out about the results?

Quentin: I was getting speed up, so I was like, oh yeah, duh, I get speed up, we expect it, and the study will be done. And then at the end, I was sent an email of the early draft of the paper and they said, “We think you’ll be very surprised at these results.” And I was. I was reading the paper myself and I was like, oh my god, it’s an average slowdown, just totally separate. Because in all of my interviews and all of my issues, I was talking about how great AI was and how it was speeding me up and as expected, proceed kind of thing. It was very shocking to me and I think the study designers.

Abi: I think everyone estimates speed up because it’s so much fun to use AI. That’s why I talk about subjective enjoyment is very high. We all sit and work on these long bugs, and then eventually AI will solve the bug. But we don’t focus on all of the time that we actually spent. We just focus on, it was more enjoyable to talk back and forth to an AI than it was to grind myself on a bug that no one else can really talk to me about.

I think the main improvement of METR is that they sat down and they actually put data to the claim either way. Regardless of what people think, regardless of the subjectivity of enjoyability of AI, if you actually sit down and record people and take data on it, there are failure modes. Some people respond, “Oh, skill issue. You just need to be better at using AI tools.” And maybe, but that doesn’t mean that these people don’t exist. It doesn’t mean that… How long does it take for someone to become net-positive with an AI tool? How does that differ across front-end devs and back-end devs and kernel devs? All of these answers are what we should be arguing about, not whether the METR result was valid or not. This is real data. These are real good devs. We should talk about the repercussions of this.

So, you spent time really reflecting on the takeaways, the ramifications of this study, and you posted a lot of your thoughts on Twitter. I wanted to go through each of the main bullets that you shared publicly because I thought there was a lot of really interesting insight there.

~~So,~~ one of the things you said online was that you believe that AI speed up is pretty weakly correlated to people’s abilities as developers. I think you saw that one of the critiques of this study when it initially came out was people focusing on the aptitude of the developers, which you feel was properly controlled for and focused on in this study.

Quentin: Yeah, so after the study was released, I did go through and read some of the GitHub, because these GitHub issues are public and I can read who the devs are. I don’t know their names, but I was able to look at some of the code they wrote. I think these devs are solid. I think it’s just subjectively true that we can all think of devs who are maybe more senior are extremely good at their job and do not use AI tools and don’t intend to ever. I think that if they don’t know how to use Cursor, that doesn’t make them a bad dev. People are working on open source. Another one is people say, “Oh, if it’s open source dev, then they’re not good enough to work on closed source.” I think this is a silly argument not worth even responding to.

But in any case, I think that the failure modes found in the study are not because people are bad devs. I think this study is more about how well people can use AI tools and how well those AI tools map to problems. That’s what I’m trying to focus on.

Abi: And one of the failure modes you talk about, you’ve touched on this already, is this maybe disparity or phenomenon around that working with LLM tools for coding feels easier and more enjoyable, but that perhaps we collectively, we developers, the industry, were conflating some of the joy or magic of the experience of using these tools with what the real hard time savings are. Am I understanding that correctly?

Quentin: Yeah. I think there are, like I mentioned before, there’s documentation, unit tests, refactoring these things, AI one-shots, and it feels really good because these are plumbing tasks that we all have to deal with as devs and that we don’t want to. I think the first time people feed this to an LLM and have it get fed back in 10 seconds and your problem is done, it’s just so enjoyable that you try to apply that more and more to other tasks within your workflow.

And then you’re faced with a choice. You reach a point where something is out of distribution, you want to write a kernel and you’re trying to use an LLM because you’re so used to feeding grunge work to an LLM, and then do either try and force the LLM to do what you want or do you… And that’s also a sunk cost fallacy. I’ve already spent 30 minutes trying to get this model to even write me a scaffold for a kernel. Do I spend 30 more minutes to just get it? Maybe it’ll figure it out. Or do I accept that I wasted 30 minutes on this task that I’m already stressed about and start from scratch as a human?

These same fallacies that we deal with in everything else also are in AI, and that’s what I’m trying to talk about. People need to fast fail. So, time box when you’re working with an LLM. I will try to get the LLM to write me this function, but if in 10 minutes, I need to stop myself. I need to go write it myself, and then maybe even come back to the LLM or maybe try with another model for a few minutes as well. But we need to realize that we shouldn’t try to crank and force an LLM to do what feels good for us, which is to just crank something out. It’s to just-

Abi: Another point you made, which you’ve touched on, is this spiky distribution, you called it, of where LLMs are capable, what types of tasks they’re suited for. I think you obviously have your personal experiences with it, but what’s your advice for developers that are earlier in their journey right now or organizations that are trying to give guidance to their engineers on… There’s a lot of pressure right now on, “Hey, we’ve got to be using these tools. We’ve got to be using these tools a lot.” What’s your advice?

Quentin: My advice, it’s unfortunately not super objective, is that people need to self-reflect. So, people need to step back and think, am I being sped up by this thing? Is this model producing good code? And then also for my colleagues, sometimes my colleagues produce a PR and I’m like, “This is an AI and this is super bloated, I can tell,” and then call it out instead of the gradual accept the PR, accept the behavior, keep going forward, and then you end up with this huge bloated code base because an LLM is trying to work with kernels when he doesn’t know how to do it, for example.

At an organizational level, if I was, I don’t know, in big tech and I was a CTO deciding what tools to use within my company, there would need to be some way that I quantify how well AI speeds up each sub-part of the company. And then custom models I think would also help this. If someone can write a kernel like writing model, it’s not impossible for models to write kernels, it’s just there’s not a lot of data for it. So, either running a study similar to this one on people across orgs or just letting self-reflection bubble up from teams to managers, technical managers, to say, “Hey, models are not really helping kernel people, so then maybe try to circulate within the kernel org chart to use models less or use them in this different way, just better practices.” That’s unfortunately all I can say, all I would suggest. But self-reflection at the dev level is, I think, the most effective out of these.

Abi: That makes sense. Just servicing that self-reflection and feedback across and upward in the organization, and then telling that back, communicating that back to everyone to enable everyone to take those lessons forward.

Quentin: Yeah. And this is the only reason I was able to get speed up. So, I’m not some LLM guru trying to share everything for everyone else. I adopted LLMs early and then I very quickly realized, oh, wow, models are actually slowing me down. And I didn’t time it, which is what METR came along and finally did, and I’m glad they did, but I was able to realize by looking at myself and being like it felt good to finally get this kernel from a model, but the kernel is a bit worse than it should be. It’s huge. It’s bloated and it’s a mess, that kind of thing. So, realizing that myself and self-correcting is the only way I could get speed up by this point.

Abi: So, we talked about the importance of self-reflection and sharing those learnings within an organization, but you as an individual developer, what’s your heuristic for when, “Hey, I’m going down a rabbit hole?” What are the hints and signs you look for to know this is not the right task model fit or something like that?

Quentin: I see. I can give unsatisfying answers, unfortunately. For one, I’ve been writing AI R&D code for a long time, so now I’ve written enough kernels to where I know I’m not making progress towards the goal of a final good kernel or a final good parallelism strategy. Parallelism and distributed code is another one the models are really bad at.

Being a good dev on whatever it is you’re working on, knowing what a good GPT-NeoX, which is the repo I was working on, feature looks like, what good code looks like makes it easier for me to realize when a model is producing bad code, bloated code, code that will conflict with other parts of my own repo, which requires my repo knowledge to recognize.

And then in terms of the human element, it’s more am I fighting with this model for it to… Am I fighting more with the model or the problem, is one way I could summarize it. If I’m trying really hard to get o3 to understand my code base and what I’m asking it, then I need to stop.

Abi: So, it sounds like a lot of it boils down to your judgment as an experienced human developer just as if you were reviewing a junior engineer’s code, being able to actually detect things that may be going wrong, smells in the AI code, it sounds like.

Quentin: Yeah, I’m very critically reading the code. I’m not looking at what the code is producing as a source of truth. I’m more trying to punch holes in what the model’s producing. And if I can punch holes really easily and I’m totally constantly telling the model, “Edit this, improve this,” I’m fighting with the model more than I’m fighting with the problem I’m trying to get the model to solve. And that’s an indicator for me.

Abi: You wrote about this concept called context rot online. That wasn’t a term I’ve come across before. Share with listeners what you’re talking about.

Quentin: Well, context rot is not something I invented. Context rot comes from this Hacker news thread from a user named WorkAccount2 who I hope at some point tells us who they are. But context rot is when as you increase the context with irrelevant information, for example with Gemini 2.5, they have really long context, so maybe once you go beyond 100K, 200K, or even multi-turn conversation, once you’re at turn 10, 15 or so, models start to get overwhelmed with irrelevant information and they start getting confused. For example, o3 will write you diffs to make edits to the code that you’re trying to work with, and the diffs become less and less accurate. It hallucinates within the diff of what’s already in your code, what it’s produced earlier, that kind of thing. So, it’s very easy to detect this.

But context rot, it’s just multi-turn and long context together tend to confuse models with a bunch of irrelevant information. And we can intuit this. This happens to humans too. If you read this super long paragraph and you’re trying to do a reading comprehension task, and if something is buried inside within a bunch of irrelevant or confusing, a bunch of rabbit holes and stuff, then it’s harder to pull out that reading comprehension fact than it would be if it was two sentences. Models have the same problem. Attention is super powerful, but not powerful enough to sift through a bunch of dead ends, I think. I think that attention heads get a bit overwhelmed

Abi: Then so what are the implications for developers? How should developers work in a way that navigates around this problem?

Quentin: Well, there are two possible readings of that question, and I’ll answer both. So, one is if you’re a developer who is a user, then what you should do is create new chats whenever the model is starting to hallucinate or repeat itself. Like the diffs thing I was mentioning, you start to see signs of models getting confused on what was earlier in the context. Whenever you see that, you should take chat. And then models are really good at summarizing, so you feed the chat into a model to summarize it down, say, “Summarize this entire conversation into its key points.” You take that, and then that’s your next prompt and you start a new chat. And you could make this automatic, which I’ll get to in a bit.

Another one is just open a new chat entirely and you yourself start talking to the model to keep moving forward on the problem. Maybe at the very initial chat you say, “Write me a unit test,” make it do this thing X, Y, Z, and then eventually you end up with context rot. And then you open a new chat with the same model and you give it the unit test and then you feed it the unit test that you’ve been working on, solve this problem. That will be much more effective than continuing to fight within the same chat.

Now, the second way to read this is I’m also a model developer, and how do people who build models resolve this? One thing that would work tomorrow is you would just feed… You’d be able to restart a chat, but with summarization. Maybe I’m working with Gemini 2.5 Pro and I get context rot. I should be able to just click a button that says Summarize and start anew, and it does that for me so I don’t have to write separate prompts.

Another deeper fundamental question that I’m also working on is, or at least my team is, is how do you handle long context? Because everyone’s trying to race towards the biggest number, one million, two million, infinite context or whatever else. It doesn’t solve the fundamental problem. If anything, it makes it worse because now you can jam in even more irrelevant context. So, trying to make a balance between context and RAG and other things, or maybe you have, I don’t know, meta heads within the attention, which gets even more heady within model architecture. But even it would work now to set up a RAG system to work over past chats and just keep the last couple of turns of convo within context. These sort of things are an intermediate term improvement hopefully for models.

Abi: I want to ask you more about your experience as a model developer and what you see as the biggest technological constraints right now to all this, but we’ll come back to that.

I want to ask you about another point you raised online, which is the idle time problem. Between developers going down rabbit holes that are actually counterproductive and then the idle time they spend waiting for the AI to actually do the work, I think it’s really easy to see how this can add up to waste a lot of time and take a lot longer than just doing the work. As a manager and as a developer, how do you think about this problem?

Quentin: Yeah, and reasoning models have made this so much worse because the response time is now solidly out of the human attention span. And as a developer myself, it’s more just, this is why I said I’m being a grampy, but I block social media when I’m at work. My phone’s on do not disturb. You have to be self-disciplined in that I’m going to work on writing an email, or there’s all of these very quick small tasks that pile up within my day that fit within the 15, 20 seconds it takes o3 to create a reasoning trace and answer my question. So, that’s my answer as a dev.

As a manager, I, again, just push the same thing and tell people, “You need to build sub-tasks.” At least think about the problem itself. If you need to be in a deep flow, you can’t afford to go answer emails and you’re not a manager who has to deal with a bunch of emails, then you need to think, was the question I just asked accurate? Is there something I’m missing? If the model answers this poorly, what will be the next question that I ask? Small sub-tasks within the problem, and it’s just stepping back. Within the research process overall, it’s very easy to go down rabbit holes. It’s very easy to productively spin and keep solving problems that don’t push you towards your larger goal. So, stepping back and thinking, is this the right problem to solve? Am I on the right track? Do I need to try a totally different research design? These sort of things are very productive, I think.

And then I also think this will get much worse once OpenAI… All of these model training companies, except maybe for Anthropic, are trying to make social media a part of their offering. ChatGPT is working on creating some sort of AI-assisted social media. And when it’s right there within your terminal, it’s going to be even easier to click this, so I think people need to really buckle down now or it’s going to get much worse.

Abi: I want to move into how organizations should be thinking about enablements, rollout. one of the things I come across a lot is when developers aren’t having as much success or reporting as much success with AI tools, inevitably someone puts the blame on the developer. That’s just because you don’t know how to prompt or provide the right context to the LLM.

I’m curious, as someone who’s deep in this field and a developer,How much truth is there for day-to-day development? How much variance and outcomes you think really boils down to how effectively someone can use some of these tools versus just the limits of where the technology is at or the code base and other things?

Quentin: It’s both. Thinking of this entire problem in terms of failure modes and then working their way up is a more productive way, because if someone doesn’t realize that what they’re asking is out of distribution for the model and they try to just keep prompting the model in better ways and hopes, is that the model’s fault? Is that the person’s fault? It’s both. The model should have this within its training distribution, and the person itself should realize, oh, the model’s not good at this. I either need to find another model which maybe has this in its distribution, or I need to… Even, models learn in context, and if someone is able to provide a small example, maybe if I feed a kernel that’s similar to the one I want to write, models get a lot better at being able to, because a one-shot task instead of a zero-shot one.

So, it’s fuzzy. I think it’s a very confusing way to tackle things, so we instead think there’s a problem. Some things are out of model distribution. Context draw is a problem, these failure modes and working your way up on how to deal with these problems in the near term until models get better I think is the better way to go, but it’s everything.

Abi: And ultimately, who bears the responsibility for this? is it a reasonable expectation that the tools themselves should be providing us feedback on when we’re going outside the scope of the distribution, as you’ve described?

Quentin: Yeah, there are some ways we could improve this. For example, one, Anthropic released a paper that other people have picked up where attention blocks tell you the score that they have when they… They tell you what they attend to the most. Remember that reading comprehension task I gave you previously? So, we can tell that attention pulls out the one fact that it needs to answer the question. And we can give the user the attention scores, and maybe if there’s a very flat distribution, attention is looking everywhere and doesn’t understand anything, that is one signal to help people. I doubt many model providers do this, but it is a surmountable problem, I think.

Quentin: Okay. You can also… The model produces something called a logit, and a logit is basically a probability distribution over all of the tokens and within its vocabulary that it could provide. Ideally, you want it to be as close to a delta function as possible, so you want the entire vocabulary zero and then a spike on the exact token that you want to produce next.

In addition to attention scores that I mentioned earlier on where the attention blocks are actually looking within your context, you could also look at the logit distribution and say, “If it’s a flat normal distribution, then the model thinks it could be a lot of different tokens.” This is another thing that we’re not going to get, that model producers are not going to provide, because if you produce logits for everyone, then people could distill much easier on your models. And model providers want to increase their mode as much as possible.

So, again, there are very slight improvements you can make on the modeling side, but one, they’re not really going to help developers that much. You’d have to explain in detail what a logit is and what to look for. And two, there are very strong commercial aspects of why model providers won’t let you look under the hood enough to see this. And unfortunately, you need to look more at hallucinations. This differs also per model, which is why I don’t want to use Cursor. I don’t want to learn Claude on its own, and I don’t want to relearn Claude in Cursor because Cursor changes the context that I provide Claude and talks to it in different ways than I am used to.

Keeping your model horizon small. I only use these models. I don’t change very often, and when I do, I change very little at first and then gradually change my workflow. These developer hygiene suggestions I think are the most effective, unfortunately. I think it’s just another form of digital hygiene in the same way people say, “Be careful about social media,” and these sort of things. I think developers are now in this world of we need to be careful of AI tools and the hygiene with which we interact with them.

Abi: How do you reconcile your recommendation to pick your tools and stick with them with also the reality that these tools seem to be evolving so quickly, what should developers be thinking about? Should we be sticking with past generation models longer? What really is the level of improvement in each iteration? And what’s coming in the future?

Quentin: Well, let me clarify that. I’m not trying to claim that we should stick to old things and not switch, because they’re exciting new models and I use them too. Kimi just came out with a really nice model that I want to try. I’m more saying when I move to a new model, I do it with tasks that I know very well.

So, again, I’ll fall back on the unit test. If I have a unit test to write and I want to try a new model, I’ll give it to the new model. Unit tests are always within distribution. There’s a lot of unit tests out there. And then I also have my guard up, so I’m like, how do I chalk to this model? I’m very hypersensitive to failure modes during that time. And then over time, I’m gradually moving more and more to the new model if I like it or if it does a test very well.

For example, Gemini 2.5 Pro is amazing at summarizing code, explaining code. Part of that is also because that it’s super long context. I didn’t start using Gemini thinking, I’m going to use this as a summarization tool. I started using it for more and more and found that there’s failure modes, it has very specific failure modes, and tried to use it to its strengths.

Another one is that Claude is good at writing very short code without a lot of comments that’s more human-readable, because their ROF pipeline is very strong. They’re very good at writing code that is similar to code written by humans. So, I use it for documentation or writing comments on the code that I’ve already written. It’s very good at that.

So, these tiptoeing into models, being very hyper-aware of when they’re failing, are there commonalities and how they’re failing without what I’ve done in the past with this model, and tiptoeing into new models instead of diving in with everything, because then there’s too much moving at once. You don’t know what the model failure modes are. You don’t know what… And everything else, and it becomes a mess. That’s what I suggest.

Abi: When I speak to larger organizations who are trying to deploy AI across their developers, I often hear them lament about some of the challenges, the hidden challenges of applying AI to these larger, messier legacy code bases.

Quentin: Oh, both for legacy companies and newer companies, I think the problem is the same. It’s obvious that people should be using models for refactoring test cases, documentation, code commenting of code that already exists so that’s more human-readable. These tasks any model can do today. It’s very straightforward. And any company should be using this. That’s what I think. It’s these code generation, producing novel functions. Within research, there’s a reason METR is focusing on research because it’s the frontier of what models cannot yet quite do, and a common failure mode of a lot of models.

In these R&D companies, you have to think much more deeply about whether AI is applicable, whether you want to pull it into your organization. There are legacy R&D companies for which using models to generate code does not make sense and will lead to an average slowdown, like in this study. So, it’s not necessarily legacy versus new. It’s more so about the tasks that you want your model to do and how to speed up your organization.

Abi: There does seem to be a variance in terms of code gen, so generation of code for new functionality. There seems to be a spectrum in which companies fall in terms of the level of success they’re having across different code bases with that in particular. I’m curious, and you’ve talked about already that that is a class of tasks where LLMs maybe aren’t fully quite suited for yet, but is there anything the organization can be doing to maximize their chances? Or is it just down to the models and the training data distribution, as you said?

Quentin: So, it’s not necessarily code gen because models can generate great code for some tasks. It’s more about, I’m really trying to drill down that tasks are how people should be approaching this problem. It’s not necessarily a code base or anything else or a team. It’s about what tasks they’re using the model for.

Unfortunately, if a team wants to use a model for AI R&D code at very low level, your only option is to improve the model, meaning that you need to use the fine-tuning service of OpenAI or produce a model in-house. Nvidia has very good hardware models in-house because, one, they didn’t want to give their hardware training code to anyone else. And two, because that’s a problem that’s a common failure mode for models. Those are your only options if your task is out of distribution. But again, organizations should focus on tasks, I think.

Abi: I want to transition into the question that comes up a lot these past couple months when I talk to leaders and companies, There’s generally a sentiment right now amongst especially executives that everything we know about software development is about to be flipped upside down. Everything is changing. Everything is different. The knowledge we have about the SDLC from the past is no longer relevant. Where do you fall in terms of your point of view?

Quentin: I think that org leaders need to, again, decompose the SDLC into tasks and then determine per task, well, because the tasks differ per company as well, which of those tasks can be modeled by AI easily today. And then once they find which… Okay, the planning phase should be done by humans, but later on when you’re actually doing the implementation and doing PR review or acceptance tests or those sort of things, those are sub-tasks that can be routed to models. And that should be all of your developers on your teams, that should be the first thing they do, is try and route it to a model, try and get the model to produce an output. And then you’re going to have a bunch of tasks that aren’t mapped to models today, and then you just reevaluate over time. In the next 18 to 36 months when a new model comes out, is it able to pick up this task that’s on my big list? And the coverage across those tasks that are achievable by a model will increase over time, is what I expect.

But this is something everyone can do today, I think, to map the SDLC to tasks, map the tasks to AI ability, and then readjust as new models come out.

Abi: What’s the best way to do the latter part there of task to modelability? So, let’s just take unit tasks for Ruby on Rails. How do you go about saying, “What is the ideal model for that right now?” This is just experimentation?

Quentin: You could try and distill from past human-created within your org Ruby on Rails unit tests that you like, and then you do A/B testing. You could do this very quickly within your team of, spend 30 minutes, look at the model output, look at the unit tests that you produced. Because you would just feed the model, given this function, “Write me a unit test.” You see if it was able to pick up these real-world cases that your test already test for. So, you can try and make it quantitative on the fly, but it’s really task-dependent. It’s hard for me to give sweeping answers to you.

Abi: And today, how much variance do you think there is in the different popular models? If we’re just looking at this OpenAI model, this Google model, this Anthropic model, does it really matter that much for the Ruby on Rails tests?

Quentin: Yeah, unfortunately, it’s very model-dependent what’s within and out of distribution, because it depends on what they trained on. Every company has different code bases within their data sets, and those code bases have more or less of whatever the task that the user finally wants. And teasing out how much coverage is in the data set is what I’m saying. When I go to a new model, I give it more and more tasks to see what’s within and outside of distribution. And this also changes per model within the same company. So, Claude 3.5 or 3.0 versus 3.5 versus 3.7, there are differences in their post-training data sets, and across generations across their pre-training data sets. Grok 4.0 and Grok 3.0, for example, are very different in what they can cover.

So, unfortunately you have to evaluate everything. But I think this just ties back to what I said earlier of finding within your organization quick, verifiable tests like the compare the Ruby on Rails unit tests with the old ones. Do have an A/B test set up that you already have that you can quickly evaluate new models and models from competing AI companies. I think this gives the most bang for your buck in terms of time.

Abi: Last question I have for you today is we’ve talked about a really practical approach to deploying models at companies. So, as you said, first map the work to tasks, then map to tasks to models. What are your thoughts on, this is a little separate, but this multi-parallel agent paradigm of development?

Quentin: My take is that I think when navigating a huge space like this, it helps people to be very grounded and to not follow extremes on either side. So, agents will not take over everyone’s jobs, nor are agents some totally spurious, silly thing that everyone’s jumping on.

In terms of agents themselves, I think they will be very important going forward. I’m working on them. I think that being able to actually offload the entire process instead of just prompting a model is super important. I think the technology is not quite there, so getting tool use figured out and computer use for models is a really hard problem. And I have had pretty negative results myself when trying to use models today. This will change, but I think we’re in a early ChatGPT moment thing of where the distribution is pretty small over where these things can be useful.

Abi: Quentin, thanks so much for your time today, talking through your experience in the METR study as well as your point of view perspectives on AI and engineering. Great conversation and I think listeners will find this really valuable.

Engineering acceleration tools

Unpacking METR’s findings: Does AI slow developers down?

Show notes

The biggest takeaways from the METR study

Time-boxing is key to avoiding sunk-cost fallacy and watching out for context rot

AI tools introduce more idle time than you think

Prompting skill helps—but isn’t everything

Focus on task-level fit, not team-level rollout

Model behavior varies—test before trusting

Tool sprawl creates friction and other limits

Timestamps

Transcript