Brian Scanlan:
Thank you. Cool. So great to be here. I’ve had a great day and this is the coolest room I’ve ever given a talk in. Such an amazing view. Yeah, so I’m from Intercom. I’ve traveled over from Dublin in Ireland to talk at this conference, but also I kind of feel like I’ve traveled from the future. Looking at many of the talks and from talking to other folks over here on my visit, I think Intercom’s a little bit advanced compared to where most people are at. Although everyone is pretty much working on the same stuff as well, which is really interesting. So I’m going to be talking through what we’ve done, how we’ve achieved doubling throughput of our engineering team. And kind of be pretty open and honest with like some of the information, things that didn’t work. But I will be talking, go through some thought leadership-y engineering leaders sort of content. But I’m going to go to some detail into like the actual Claude code skills and things like that that we use.
So Intercom is a 15-year-old Irish American B2B SaaS startup and we’ve pivoted kind of like everyone else, but we’ve done it pretty successfully and aggressively towards being an AI company. So on this graph here, this is like the growth rate of SaaS businesses kind of in our cohort. And you can see like SaaS not doing great generally, but Intercom is doing great. We’ve done like turned around growth rates and we’re certainly bucking the trend that SaaS businesses have been suffering from recently in terms of growth and valuation and things like that. Additionally, we’ve been a bit of a poster child for companies redefining themselves in the age of AI. New York Times recently wrote an article about SaaS companies reinventing themselves that prominently featured Intercom.
And being an AI company means a lot more than just like slapping together some wrappers around your existing product. We built an AI agent for customer support, completely displacing effectively our previous products, which is a help desk. We’ve got 8,000 customers, 100 million revenue. Growth rate is like extremely positive, you can see in the previous growth metrics. And companies like Anthropic, Snowflake, Linear, LaunchDarkly, Glean, they all use Fin, our customer support agent, to support their customers. And we recently announced that we have our own models serving 100% of Fin. These are like displacing use of the frontier models. So we’ve done benchmarks, comparisons, and our own trained models are working cheaper, faster, and better than the likes of Sonnet or whatever. And we’re also happy to sell direct access to our models that are authorized for customer support.
But I’m not talking about any of this today. I’m talking about the work we have done to push for AI adoption and improvements in how much we ship and the quality we ship and basically everything. So I’m at Intercom 12 years. I’m in our platform group effectively, and we take care of Intercom’s uptime performance, security costs, management, observability. We love these monolithic applications. So we have these giant Ruby on Rails apps and JavaScript apps. And we also, my team, my groups take care of internal developer productivity. And we are obsessed with shipping. This is a nice honeycomb sticker of a blog post that we wrote many years ago. And developer productivity is something we invest a lot in. And so obviously for the last few years, I’ve been spending a lot of time working on enabling use of AI in our software development life cycle and beyond.
So given that we’ve been so AI positive and pivoting our company, unsurprisingly, we’ve been very excited and impatient about AI changing work and adopting AI across specifically like building product, building features. And we kind of went on a normal journey, I think, like of adopting the likes of GitHub Copilot and bottoms up kind of people using Cursor. And we kind of got like some results, some improvements, throughput metrics, different things will kind of go up, but we’re ultimately dissatisfied with the results. We were aware that like the models are only going in one direction and also the harnesses, and we’ve strong conviction that there’s huge potential or that we have no doubt that AI is going to transform how work is going to be done. And so, but we wanted to do something about it and not just wait around.
So we set a simple goal and we set a goal about nine months ago that we would double the throughput as measured by pull requests, like merge pull requests per head of R&D. So if we grow the number of people in R&D, we should expect that number will kind of go up as well. And so we could argue about whether it’s a good metric. I guess every metric is a bad metric once it’s a measure. And pure throughput isn’t perfect by any means, but it’s also completely reasonable. And I think you should not be ashamed of picking a throughput measure because just adopting AI and being able to get a lot more done, this should just naturally result in greater throughput through the system. And 2X actually as well, it’s also kind of not very ambitious. We think that doubling throughput is like table stakes. Really we’re going to be looking at a 10X or 100X throughput increase. So when you connect the dots, whatever, it’s like-
Brian Scanlan:
When you connect the dots, whatever, it’s like this 2x goal, maybe we under kind of sold it or we could have gone something a bit more aggressive. But anyway, we picked the goal. We also were fortunate enough that there was a big inflection point. It’s been referenced in a few talks here today. Around the time, like December or so, with the new models and this noticeable shift in model capability and tooling just overlapped with some of our 2x push here, which was pretty nice. But change is hard and it’s not just a case of setting a goal or picking the one tool or there’s no one thing that will kind of get us to success. And so we changed a lot of things. We took a lot of action to make sure it was very clear to people what we were trying to do. And so we updated job descriptions and expectations of engineers and designers and product managers.
If you weren’t using agents in your work, you are not meeting expectations. So that’s like the stick. But we have plenty of carrot as well. We reward and promote through spot bonuses, through like social kind of proof. We praise people who are like acting as like true like flywheel creators and helping others kind of be successful. We make good space for people to grow and to try things out. We did hackathons and enablement days. We started this stuff full time as well. You need to be able to support people so that they can do great work and get over all of the kind of initial barriers that people might run into, or just like day to day kind of use of the tools. They’re unstable. You need to do a lot of work to like get these things working well. And leadership is basically talking… Or staying on message, saying the same thing over and over and over in every single forum.
And we did that and we were very clear about what we’re trying to achieve and that people couldn’t hide from it or like only kind of half-ass their adoption of AI. We also picked up platform. So we standardized on a tool. In our case, it was Claude Code. I don’t think that was controversial and it doesn’t sound very controversial, but the main thing is that you pick one tool. It doesn’t matter what you choose. We happen to pick Anthropic’s Claude Code. We think of it as not a tool, but a platform. And with that, I mean, to get the most out of the tools, you have to let go of model anxiety and comparing which one’s better and all that. And like that’s sort of interesting. You want a few people who are doing that in your org, but most of the benefits are not from the actual models or the harnesses themselves.
They’re from all of the context and information and domain specific knowledge and guides and skills and all of that stuff, all of the glue work, which is specific to your environment, that is the stuff that unlocks like huge levels of performance, accuracy and productivity with these tools. Pick one. And like we still have a few people using Cursor and that, but everybody knows that we are investing in Claude Code being the platform and building out skills and things like that. And our vision is that like we want to treat Claude or build Claude with all of the appropriate skills and everything such that they can take on any work as a senior engineer. So this means connecting it up to absolutely everything, onboarding it, teaching it about everything that’s specific about our environment, rails conventions, react patterns, testing standards, whatever, and train it. Train it as if you were onboarding a new engineer into the environment.
And so we think that adding like all of these skills and all these capabilities when combined together, Claude Code will be able to take on any bit of technical work that engineers are doing in Intercom today. And you know, we want this platform to be self-updating. We add this into our guidance and skills and stuff to make sure that if new knowledge is discovered or something novel kind of happens when doing a task, update the skill, update the knowledge and try and get these things automatically. So yeah, we strongly believe that all technical work is going to become agent first as quickly as possible. All of the stuff, the gunge, all of the messy kind of things that we do as part of the like shipping product should now be condensed down to like this nice, simple… Claude is basically doing everything and we’re kind of interacting with it.
And we think as well, like if the models and harnesses don’t improve, like they just stay exactly where they’re at today, something that’s absolutely not happening, but we have the building blocks that are good enough today to get through basically all technical work and make it such that they’re agent first. And so I wrote down some principles. These are like a sneak preview of some principles. We’re going to publish them publicly pretty soon because like again, you’re trying to convince hundreds of people to change how they work and how to think about how they work and where they should be applying their attention and focus and stuff. And so these principles, these help people kind of decide and understand like what we’re trying to achieve. In this case, like all technical work is becoming agent first. We repeat this a lot. It means that like everything that you can do in your laptop, like absolutely every single bit of work, an agent must be able to do those things.
And so having these kind of clear guidance and expectations about how we go about the work unlocks a lot of like capabilities or people like know that we want them to implement APIs or MCPs or CLIs or whatever to allow the agents to do the work that we do. We also want people to run less software and that our focus needs to be on the kind of evergreen specific capabilities. I think the Airbnb talk made a kind of similar point on this. The space is moving fast. There’s many, many new features and you kind of do need to plug in things like glue work… The glue things together or whatever. But these technologies are moving so fast, you need to like aggressively deprecate kind of custom things you do and we want people to be focused on evergreen, durable things, which is largely unblocking the agents from getting access to things or knowing what to do then.
So providing skills, guidance, that kind of stuff. That stuff is way more valuable and durable, has lifetime value far beyond say vibe coding your own multi-agent orchestrator or whatever kind of fancy things that you can do out there. It’s like, “We want people to be focused on the core like knowledge and tools and access and skills that senior engineers need to do,” not on the kind of stuff around that. We consider ourselves to be leading AI and engineering and we don’t want to get stuck with a lot of custom stack, a load of our own kind of monolith style workflows or capabilities that would cause us to kind of get stuck and left behind. We also want people to like work with agents, not in the kind of telling the agents what to do. It’s more like, “Yeah, tell them your problems. Share your problems with them.” And often a lot of the time at the moment, it’s like… We’ve written hundreds of skills and a lot of people just tell… Like show up in Claude and go like, “Hey, run this skill to do the thing.”
And it’s like mostly fine. I still do it myself, but really, you want to be able to just describe the problem to the agent, that the agent figures out from the list skills available and stuff, and it figures out what to do and it plans it out. And I had like a nice example of this recently where I was paged into a security incident. Somebody had accidentally published a bunch of Snowflake table metadata into a GitHub repository that was public and I just habitually joined a Slack channel, opened up Claude, said, “Hey Claude, take a look at the Slack channel.” And I went back and I was looking at the details of the incident and chatting on Slack. And then like two minutes later, Claude came back to me and said, “Hey, I’ve figured it out.” And I didn’t know that we had a skill that had been defined by one of our engineers which used our kind of breach criteria and run books and policies and all of this to figure out and analyzed all of the files that were part of the breach.
It used all of the information that was encoded in the skills from all of our policies and… Like classifications of breaches and that. And it basically just listed like, “Here’s exactly what the story is, here’s next steps.” Turns out it’s a bit of a know up. And like that removed just like about 20, 30 minutes of kind of boring work to be honest, just kind of checking this stuff manually. It’s not exactly rocket science or that interesting, but completely necessary. But the fun part here was like, I didn’t know that that skill existed. I didn’t tell it what to do. It just told us that there was a bit of an incident going on and it figured out itself what to do. And just like getting these aha moments just like really justifies our approach here. I think like by writing many, many of these like domain specific skills or task specific skills, we’re getting a lot of value and just… You don’t even… In this case, like I said, I didn’t even know that this skill existed, but it did the right thing and figured it out.
And that’s what we want to see across all work. Even Intercom, like AI adoption is unevenly distributed. We’ve got teams, people, all at different levels of maturity and we do a lot of work, enablement work and that to like help them understand or like give them space to do things. But I think Steve Yegge recently released or talked about maturity rating for engineers about like how AI pilled over the years. And I think like this is like how I think about this internally inside of Intercom. It’s similar enough, but kind of goes in a little bit of a different direction. And depending on the task and stuff, sometimes I would like regress. Sometimes I’m at step three or four or something. But the first few steps are definitely like people just adopting, getting used to these things and then starting to only use the tools to produce code.
But then later, the way we want people to work with this stuff is use Claude Code for everything, automate your work, write skills, write really good skills, write skills, update the skills, and then you start to optimize the environment for the agents. And we’re starting to see this in our software architecture, in like different decisions that we’re making that we’re like, “We’re starting to bend or kind of shape our environment so that agents just have an easier time or you can get more done.” And so an example of this would be, we’ve dealt with areas where you might be plugging in a load of different, say, third parties into it, like say messaging providers and they all have kind of similar APIs, but they can kind of differ in implementation and details. And so we’ve had a lot of success by building like these really great solid cores and then being able to rapidly kind of let agents like do all of the annoying work of looking up docs and integrating with SDKs and all of that stuff at scale and plug into this like solid cord.
And it’s probably good software architecture anyway, but it allows the agents just to move superfast here and get like really, really, really consistent output. So here’s where we’re at. So actually, we met this goal in the last few weeks. So we doubled PR throughput in nine months and we actually tripled PR throughput in 16 months, which was… We only kind of recently realized this. You can see a wild inflection point as well around December 2025. Again, we had like the agents getting better, harnesses getting better, but that was also the same time that we decided, “We’re going all in one tool.”
We had built out the full-time team to support and setting up skills and setting up all of the things to make things really, really easy for people to adopt. So today, yeah, PR throughput is well over double what it was nine months ago, and we see no sign of this stopping as well. This is continuing to go in this direction. Some data I pulled out…
Brian Scanlan:
… continuing to go in this direction. Some data I pulled out last week, so we’re at 95.9% of pull requests being authored by Claude. We also have the bottleneck of approvals, and so we’ve been doing something about that. We now have Claude approved, like fully agnentically-approved pull requests. At the moment, it’s around 17%. We basically work with Claude to define what’s extremely safe and we’re confident that these pull requests can go to production without any additional oversight or without having to force a human to be in the loop on it. And we expect as well, we’re continuing to work on this. We want over 50% of our pull requests to be fully approved automatically. And we’ve been doing work, working with our auditors, making sure that there’s no risk here to SOC 2, ISO 27001, et cetera.
But the thing about these approvals, I think they’re actually being done better, like a higher standard than humans would have done. They’re consistent, they never forget anything. And I think as well, just like water flowing down a hill or something, we think the work will bend towards the path of least resistance. And so, when people see the shape of changes that they can make and get automatic approvals, they’ll shape all of their work towards it.
The pull requests have to be small in terms of code changes, feature flags used whenever possible, metrics available, observability things. Basically, they just have to follow all of our best practice that we would consider to be good. And the description of the change actually matches what the code does. And so, we expect this number to increase a lot over the next while.
In the skill invocations, so we’ve been writing hundreds of skills, lots of people using them. We kind of measure these things a few different ways. We send all metadata about sessions to Honeycomb, so we can see exactly who’s calling what, different bits of information about individual Claude Code sessions. And this is like internally available and we can go into these dashboards and look at them. But we also pull out session transcripts. So, we hook into Claude Code, use them hooks and things like that to copy the session transcripts for every single session into an S3 bucket. We anonymize them and we then data mine these for insights, or they’re just very handy for tech support as well. We have hundreds of people using these tools. When something goes wrong or something goes awry, we want to know about it and it’s really, really useful to have the session information there to hand so that we can like proactively go after or improve things by looking directly at the session data and not just relying on the human to tell us what went wrong.
This graph is maybe one of the most interesting ones. This is our defect list. We’re not particularly proud of this ever-growing defect list. This is over the course of a year or so. And we haven’t put a huge amount of focus on this. And this hasn’t been a goal or target, but teams are getting through more work and they’re getting through more defects. And some teams have been pushing for things like defect-zero. And so, there was a question asked of another talk like, what do people do with their time that they’ve gotten back by using these tools? And well, one thing that they’re doing in at Intercom is closing defects. And so, we’re actually down now like over 50%. I think this is a little out of date. This is like a month old, which is wildly old. And we’ve removed more than 50% of defects from the peak whenever it was, in January or something like that.
We’ve also been looking at some other data. We were worrying about, say, code quality and we’ve been partnering with a research group in Stanford. We shift and give them all our code and they give us some metrics. And over the course of this year, code quality did start to go down a bit based on rework and complexity and a few different metrics that they’re looking at, but then over the last few months or weeks, it’s gone a completely opposite direction. The average code quality or the average additions that we’re adding to our code bases, they’re improving the overall quality of the code base.
Again, it wasn’t a specific target. It’s definitely something we’re interested in, but it was amazing that this just got turned around by us working with the agents, giving better guidance, linting, all of the guardrails. And we’re seeing this like objectively, we have data from researchers that shows that things are going in the right direction, which is really interesting.
Other stuff that we’ve been looking at as well has been time from initial code being written to the time when a feature is announced and that’s been compressing as well, which is interesting. And yeah, these are training, these are not targets, these are just things that are happening in our environment. And I’m very interested in maybe pushing more teams to go after defect-zero.
So, going through like our Claude Code setup, we have dozens of plugins, there’s like 42 plugins and hundreds of skills. And we’ve been breaking down access and giving access to as many things as we can. So, anything I can access on my laptop, the agents must be able to access as well. And so, that means like fully production, full data, everything. So, we have core plugins, plugins that are there to like make sure Claude Code is working exactly the way we want it to. And we have extremely high-quality skills that are used often by like say every single software engineer. And we take time to make sure that these are high-quality and running evals against them and really push them on the quality side.
But we also want it to be easy for people to distribute skills, especially on teams and things like that. A lot of the work is just like local to teams, and we can have a more relaxed quality bar when it comes to non-core things as well. We don’t want to be like gatekeeping too aggressively here. We want people to try things out. And then, maybe if skills get adoption or we can use the session data or telemetry data to see what’s being used and then where to invest and find gaps and issues with them.
Yeah, we have hundreds of people contributing to these things, thousands of changes going through, loads and loads of eval test files. But the main thing is we have an extremely high-quality bar with any individual skills. Skills need to be small, composable, testable, and not just trying to do a bunch of open-ended things, and we really, really push for that quality bar on, especially our core skills. You can’t build a senior engineer out of a bunch of skills and tasks that are only, say, good 50% of the time. That would not be a high-performer on your team. And so, really sweating on the details and quality is absolutely critical for widespread successful adoption of these things.
So, just looking through some of our core skills, this is these base plugins. So, everybody gets these on their laptops and we force install these things. We actually bypass the Claude Code update mechanisms and stuff for getting these skills and things onto people’s laptops. We use our IT systems to publish all of the plugins directly down to people’s laptops, as well as Claude Code configuration or things. This bypasses all sorts of issues with Claude Code and makes sure that we know exactly what people are running.
And then, yeah, the base plugins we run absolutely everywhere, does all the telemetry works and safety things, just generally making sure that we know how well set up and the configuration is in place. And then, we just have dozens and dozens of specs or skills that do individual tasks and do it well. Fixing flaky specs, these are like fixing flaky tests. It’s a skill I wrote, which does a really, really world-class job on fixing tests that are intermittently failing. We have hundreds of thousands of tests. Our tests read thousands of times a day. And so, you just end up with a lot of these tests. And we are shipping so often, it doesn’t like block shipping, but it slows things down. It’s a bit annoying.
And we care about these things. We open issues for them, but like, we don’t… And we aggressively skip them as well. The value of any individual test is actually pretty low. But this is a skill that I pretty much iterated with Claude Code to generate in a feedback loop, just got it to fix dozens and then hundreds of flaky specs. And so, I didn’t write all this down upfront. I didn’t design this whole skill or whatever, I just iterated with the agent and gave it a goal and gave it lots and lots of data. And we have thousands of these issues from historical data. And then, you end up with this extremely detailed step-by-step, play-by-play, here’s all of the different individual steps you have to do to like do a world-class job at like recreating an issue, finding what it is.
And then, also, you end up with like classifications. So, these classifications, these are total cheat codes and it’s like how we work as well. If I join on an incident, the first thing I’m thinking is not like, “Okay, I’m going to start from first principles and work out what’s going on.” I kind of just go, “Does this look like a database issue? It kind of smells like it is.” And you get to faster outcomes with these cheat codes. And again, this is Intercom-specific context. These classifications, most of them are universal for any rails app, but they’re specific to our environment and the skill as well has got it in it. If they’re like, “If you learn a new classification, something new comes up, update the skill.” And so, the skill is self-updating in that way. And now, just can resolve basically any flaky test really, really quickly, really, really accurately and at the standard of like… Way better than I could do, like the standard of like our best rails engineers.
Other stuff that we’re doing. So, I’ve been talking mostly about like the R&D org here. Claude Code’s gone viral across all of Intercom. We have over 1,000 weekly users. Something like 80% of people on Intercom are using Claude Code every week. And yeah, just democratizing access, the data and people are just doing way more analysis, getting way more information about customers or things using Claude Code. It’s been just absolutely going wild and viral beyond engineering.
Other things we’re doing, say right now, replacing all runbooks, we don’t want humans acting like troubleshooting outages or dealing with alarms, it must be agent-first. We’re working on remote agents, similar to the, again, Airbnb we’re building. We want to move this work away from people’s laptops and like get next levels of throughput and scalability through that.
Yeah, like everyone else, we’re also worrying or thinking about, “What does this mean for our jobs and roles and all of this?” Things are just merging, but I think it’s too early to make any big decisions, but we’ve been running experiments and trying things out to see what this new world looks like. It’s certainly, I think the planning, team organization, everything’s all going to get changed over the next year.
Anything else? Oh, yeah, I’ve been working as well on actually shipping-
Brian Scanlan:
Yeah. I’ve been working as well on actually shipping products, like shipping features. And it’s amazing. I’ve just been using Claude Code skills that to do all of the product management and all of that stuff. I wouldn’t have gone near this work ever in the past before, but now I’m able to ship stuff to the internet. So this is not directly related to the AI productivity stuff, but it’s just me personally, I was able to move my role to be more doing product management stuff and this, that and the other and released a cool CLI for Intercom. So that’s the talk. I wish you all the best of luck. If you aren’t doing pretty much all of these things today, you’re going to be doing so in the near future. We published an update, like a really first part of a blog post, which is a series on our use of our 2X project.
So there’s more detail about the stuff here. It’s on ideas.fin.ai. Fin.AI/CLI is where my little CLI is. And I’ve got links to my talks and stuff like that on brian.scanlan.ai. Yeah. I’m around for the rest of the day as well. More than happy to have chats. And yeah, more than happy to answer questions as well for the next few minutes.
Speaker 1:
Okay. So good, Brian. The chat is blowing up. People have lots of questions for you. When I was off-stage, I had asked you, were there questions that you were hoping got asked? And you offered to tell us about your costs. So I know people in this audience want to know how much this is costing you and token spend and everything. So what does that look like?
Brian Scanlan:
Yeah. We had, I think two weeks ago was our highest week so far. So like everyone else, our costs are like that. And we had a nice round number. It was like $128,000 a week. And yeah, we had like two quiet weeks because of the Easter break over the last while. So it’s kind of sort of stabilized. But like I’m pretty sure like next week it’ll be 150K and then next month it’ll be 200K. This is going in one direction.
Speaker 1:
150K per week. And you said you have roughly a thousand developers?
Brian Scanlan:
No, we have a thousand people using Claude Code. We’ve got about three, 400 developers.
Speaker 1:
Got it. Got it.
Brian Scanlan:
And like some developers are spending 20K a week or a month. Yeah, it’s adding up.
Speaker 1:
And you don’t have any caps in place of like, hey, you can’t… Once you get to this spend, you’re cut off. It’s just use it, do what you can.
Brian Scanlan:
Yeah. We have like infinite spend. It helps that we have our finance team using Claude Code, so they’re like, “Oh wow, this is cool.” And one of the interesting things we published today in our blog was like the cost per pull request has gone down. And that’s like the way to kind of think about this stuff. It’s not just like, oh, this cost and this cost is annoying. I think you should be thinking about it as like a cost that’s additional to human salary. But like, yeah, we’re getting through more work cheaper than before. Of course it’s a big bill and we do want to optimize it and we are going to be doing some stuff to like go deeper into optimizing our spend. But right now we just want… I think the opportunity cost is greater that like just get out of people’s ways, let them do whatever and pay the bill.
Speaker 1:
Pay the bill. You heard it here first. Okay. Well, some of the other questions that are coming through is what happens if some of these tools change their pricing model? What happens when the subsidiaries go away? What happens if they 10X their cost next week?
Brian Scanlan:
Yeah, I think there’s real business continuity risk in the way, like how aggressive we’ve been adopting the tooling. So Steven, uptime is an issue. I think no one has used these tools recently that hasn’t run into some outage over the last while. And like we’re doing basic stuff there. We can fail over provider, we can move to Bedrock, AWS hosted models. And at some stage I think going multi-provider is reasonable to think about from a business continuity perspective and the kind of like, yeah, what happens if Antropic falls off the internet or what would it take to kind of move over? And then I think of course as well, the open source models, they’re probably okay. I got to assume they’re probably going to be okay enough in the next while to like be at the standard where Claude Code is today. So if, yeah, if things go down like a not very nice path in terms of costs or availability, pretty convinced that like the open source models will be good enough to at least do the basic functions. So that’s like insurance policy, I guess.
Speaker 1:
Always good to have an insurance policy for sure, especially in today’s age. Tell us a little bit more about how you are stress testing these skills. How do you audit them? How do you make sure that they’re up to date? How do you make sure that a junior developer who just started at the company doesn’t create a skill that everyone starts using and is completely wrong?
Brian Scanlan:
Yeah. Paying a lot of attention to the quality and outputs is important. We’ve had that exact situation of where, say, less experienced people trying to automate some work. And like my favorite story is we had some, I think it was some interns that were looking at like resolving some exceptions like errors in our application and Claude Code is very literal. It’ll do what you tell us to do and they kind of told us like, “Hey, fix this exception.” And the exception was coming out of a custom emoji and a Discord message and it was just kind of blowing up on the customer emoji and Claude Code like fixed the shit out of it by like dropping the message if it had a custom emoji in it. And I was like, “That’s not really what you should…” And yeah, so it’s a bit of a silly story, but the point is that like you really have to understand like the… Like be able to judge the output of it.
And so we’ve socialized that as an issue like internally and we talk about the quality aspect a lot, but also we don’t want to gatekeep. We want people to do, like we do want our interns or whoever to be knocking out some skills, getting some usage out of it because like we’re all on a journey and you have to learn, you have to get some stuff in use. And so that’s why like things like our plugin set up of where people can publish these things say locally to their team or whatever and that kind of minimizes the damage or potential. But then when people want to do a better job, they notice maybe it’s not working right, like we’re there to help. So we can help guide them around what good skills look like. And we have skills that grade skills and probably a skill grading, not skill grader.
And you know, so it’s very, you can get from a half working skill to something that works really well, like very fast. And like we aggressively use evals and tests to make sure that they’re doing the right thing.
Speaker 1:
We have skills for skills, we have AI for AI. We’ve got all the things. Okay, maybe one last question before we wrap. Something that we’ve heard in a couple of talks today is this concept of being able to use agents and AI to help auto approve PRs. You mentioned that you all have about a 50% response rate and that’s still a work in progress to make that even higher. What are the criteria of what gets to be auto approved versus what doesn’t?
Brian Scanlan:
Yeah, so the pull requests have to be small, so I think it’s… And it’s like pretty strict at the moment. It might even be as low as like 50 lines or 20 lines or something like that. So we just simply refuse to automatically approve anything that’s like particularly large. And if something doesn’t have a test, we’re not going to approve it. We do have like sensitive code areas, there’s definitely different parts and we kind of give like high level guidance to Claude Code, say like, “Yeah, if it’s in a high volume code path or it’s a critical database transaction or something like that, just don’t approve it.” And that’s kind of it. It’s not that sophisticated, but what we built the process to develop the criteria by analyzing like hundreds of thousands of actual pull requests, picked out what like good looks like.
And then we got humans to grade it. So we got it to go through some sample PRs, it would get the result, and then we had like a human labeling these results to see what the percentage rate is like. Is it getting like 90% or 50% or whatever of like that it would make the same decision that the human does. And once we got into like the upper 90s, which was pretty quick, we’re like, “Okay, this is ready to go.” And we’re looking at like increasing the criteria a little bit, like loosening up, but really it’s like, we just have an opinion now about what a great pull request looks like and we want to shape all work towards that. And so rather than being expansive on like what we allow, we want the work, the inputs to be changed. And so I think that will get us the biggest results over time.