From AI experiments to organizational shift: Lessons from Mercari’s transformation

Snehal Shinde:

Yes. All right. So, AI adoption is happening everywhere, every single field, but adoption alone isn’t the finish line. So, in this room, how many people have struggled to draw a straight line between AI adoption and ROI? Quick raise of hands.

Michael Galloway:

Nice.

Snehal Shinde:

Almost everybody. We believe that the challenge isn’t just about building AI or AI adoption, it’s trying to find the value in a measurable, repeatable and visible way.

All right. So, as we are in the middle of a major shift that is happening in industry where we are augmenting software development in a very rapid way and it’s starting to reshape the role even much more faster. So, how many think that the error of just being a software engineer will go away? Quick raise of hand. Wow, that’s interesting.

So, what’s interesting is not the answer. I feel currently the real struggle is that the role is evolving much more faster than the organizations defining what a good software engineer will look like.

So, a quick introduction about Mercari. So, Mercari is one of the largest C2C marketplace in Japan. So, we have around 23 million monthly active users and growing at a very rapid pace. And more importantly, one-in-five people in Japan uses Mercari. So, at the scale, customer experience and infrastructure matters a lot to us. However, visibility is very, very important to maintain something like that. So, to build that kind of visibility, data has been in our DNA at Mercari. So, whether it’s for customers, we do a lot of A-B experiments to check what is working for the user and what is not. And similarly, we do a lot of things around engineering to track KPIs and build the DX dashboards because we want to measure and we want to learn from the failure and we want to improve our product, and also the DX experience at Macquarie.

So, a quick credibility break. The Japan CTO Association runs an annual Developer eXperience AWARD, and Mercari was first in 2023. It was first in 2024, it was second in 2025. So, when we say that we have been working on DX, that is measuring what matters, building the right platforms and defining the clear golden paths. We are very serious about it. And we feel pretty good about all these stuff that we did in the past. But here is an interesting path. Even with all that, when AI hit us, none of the dashboard could answer the question we needed them to answer.

So, about 10 months ago, our CEO made it very clear, Mercari is going to be AI-native. This means we have to push for 100% AI adoption and no exception in any of the roles and that was the mandate. Japan works in a very different way because I know like some of the earlier speakers said that they are encouraging AI in their company, but Japan is very different in terms of culture. So, we had to move fast for this. So, what we did is we pulled together like 40 engineers and formed a dedicated AI task force under our CTO. And the focus was just about building, integrating and unblocking.

At the same time… Something is wrong with the slides. At the same time, we identified roughly 100 people within the company and we called them AI champions. And these people came from every single department within Mercari, legal, finance, CS, engineering, product. And the goal was very simple, but very ambitious. The goal was human decide what matters and AI handles the execution for it. But we didn’t just wanted to add AI on top of our existing process. We wanted to rethink even our workflows on how to make it much more better and make it much more faster.

So, we did all these stuff and we run many, many steps, but we underestimated the effort required to do something like this because at Mercari, we have around roughly 2,000 employees and everybody had a different technical skill or their understanding of technology. A person from engineering might be able to really install Claude Code in five minutes, but can a CS person do that or can a finance person do that? Maybe not. They are very comfortable with Gemini or GPT.

So, for us, it wasn’t about tool change, like enterprise software that you can roll out to everybody and they know what to do with it. For us, it required a fundamental shift in culture and mindset across the entire company. So, to make it easier for us to pivot in the right direction, the leadership set a new mandate, which was rethink everything. So, what we build, are we even building the right thing for our customers? Because with AI, now when you have 10 ideas, you can put those 10 ideas in one month. But in the past, we were only able to roll out maybe one idea in that time duration.

How we build, do the process really need to take weeks for feature development? Like you design something, from design to ticket, from ticket to code, from code to review, from review to release, does it really need to take one week? And even how people work, not who is building it, but how well they are operating in their day-to-day life?

So, now the challenge with saying we think everything is it can be so big that you end up doing nothing because you’re chasing behind a perfect workflow or a perfect solution. So, what we did is we made one group. If you don’t know how to measure, you don’t stop. So, within a few months of going all in, something which we were observing was that our existing dashboards, the ones that won so many awards, every single year you were in top three, could not answer a single question that leadership kept on asking us, which was, “Is AI actually working and improving our productivity?”

So, the things that we saw which was happening during that time was local optimization wasn’t translating to system optimization. Engineers said they are fast, teams said they are fast, but when we look at the company’s end-to-end cycle, nothing has moved. Our metrics were per team, they were per repo, maybe per pipeline, and they were not stitched together across the value chain. So, we couldn’t see where we are getting stuck in the entire pipeline and we had no visibility around our AI tools. So, none of our AI tools that we use, like Cursor Dev and Claude Code, none of the data was flowing into our DX dashboard. And the quote that is said there was like awakening call for us because we realized like we didn’t know how to identify our gaps. And basically, this was our first big dip.

So, within a few months, my team built what we just call DX visibility right now. So, at the bottom layer, we had data sources like Cursor Dev and Claude Code, which we were getting from AI tools. And then, we had a key SDLC system, like GitHub, Datadog and Jira. Most companies stop when they see, “Oh, the number of lines of code generated by AI is increasing. This means we are more productive.” But as you know, like Mercari love data, we wanted to go beyond that. So, at the middle layer, what we did is we built external system connector and we stitched together certain metrics, like PR cycle time, cross-platform PRs and ticket-closure rates and this was the hard part, but this part made a lot of things for us possible because on the top layer, we were able to build dashboards which were focusing on AI adoption, like usage by team, by role.

And we were surprised with some of these numbers, which Michael will share later. And flow and throughput, PR throughput, deployment frequency, lead time, the [inaudible 00:26:39] stuff and quality and risk, like incident MTTR. The one thing that we did was we also measured gen AI code-acceptance rate. When the code was generating, how many people were clicking accept and pushing it into the code base? So, why these? Because the question AI is working is very ambiguous and it’s very hard to answer. So, you require multiple facet to find that information so you can answer those questions much more honestly.

So, once we could see a little bit clearly, some pattern kept up showing. The first one was awareness gap, the biggest single finding that we realized. So, we were giving people like, “You have access to GPT. You have access to Claude Code,” but they didn’t know like how to use some of them. They were maybe using them for menial task, but they were not leveraging enough that can provide them with the productivity gain. And some of the pocket of knowledge that were not getting shared enough within the company.

The second one was team boundaries. AI didn’t magically dissolve org charts. It ran into the exact same problem. Like when you’re working in a bigger organization, you have a PR that you raise and you have to make changes in three different code bases, you require approvals. And AI did same. When it was inside a service, it was really, really awesome. But once it started making those changes, it ran into the same problems that human do.

Third was the system complexity because AI gave us more entry points, more ways to generate code, it also had more ways to break things. The blast radius of like, “An LLM did it for me,” was much more bigger than, “An intern did it for me.” And the quotes at the bottom are real engineer feedback that we got during this adoption phase. We didn’t sanitize them because Michael promised you-

Michael Galloway:

Yeah.

Snehal Shinde:

So, the one that really haunts me is the last one, “Debugging AI outputs cost more time than we save,” and it was really, really true till last year until they did the model upgrades, so I’m thankful for that. So, that mean like AI is costing more than we are benefiting from it at certain stage. And we really cannot fix that because if you say, “This is not working,” you cannot say him, “Take a new license for another gen AI tool and make it possible.”

So, this is the complexity, okay? I’m going to go through big numbers here because Michael promised truth. So, we are going to drill down into complexity because it’s really important. I want to show you the baseline where we started. And this is what the dashboards were telling us in late 2025 before pre-AI DLC, P75 timings, and the 90-days window. So, plan, we didn’t measure it. We were not measuring how much time we took for planning a feature. Code was within weeks. And one of the interesting fact was our local dev experience was our lowest rank in the engineering survey.

We didn’t have a good local dev experience review. I think this was our first hard spot. So, some number here, P75, time to first approval was five hours. The mean was 20 hours on the big repos, like the legacy monolith. I think people who are working in big corps, they know when we say legacy monolith. No, it’s the code base that nobody wanted to touch. So, here, our PR cycle time median was 3.7 days and P75 was 7.1 days. That means half of our legacy monoliths PR took a work week to just get deployed in dev.

CI was our second hard spot and here the problem wasn’t run time, it was the failure rate. iOS on pull request was 14 minutes, 55 seconds with a 35% failure rate. Android was 16 minutes with 34% failure rate. And web took six minutes with 26% failure rates. Our back end CI was an average of six minutes, P95 and had 12% failure rate. In every one of these failure is a contact switch cost for an engineer. Deploy-

Snehal Shinde:

… Cost for an engineer. Deploy infrastructure, changes took an average of eight minute 49 seconds with 19% failure rate. Monitoring our aggregated SLO for our legacy monolith was sitting at 19%. And there were invisible bottlenecks and the biggest one was our review queue. This was the biggest drag on the company. It was not coding. It was not CI. It was waiting for another human to look at your code. Look at the diff and approve it. CI fragility. It’s not the slowness of the CI, it’s the flakiness. And flakiness was worse than slowness because the frequent context switching and a single threaded behavior was really, really dangerous for an engineer.

Support load. These are real numbers, okay guys. So platform absorb like 766 requests in a quarter. And every single of this request was stopping someone from delivering something meaningful in any of our environments. So when we get to the second dip later, like Michael is going to share some amazing stuff because he’s going to thread together that AI didn’t create these problems. It didn’t create this problem. AI just made it impossible to ignore.

All right. And this was a brutal reality. AI as is doesn’t fix a broken system. If your foundation is brittle, your pipelines are fragile, and there are gates everywhere. Security gate, compliance gate, legal gate. Something that you added, you don’t know why did you add it, but it’s there. And it’s like a tribal knowledge and it’s sitting there. And you have some of these unowned services. What AI does, it will just go ahead and break it faster. And this leverage cuts both ways. Now I will pass it to Michael so he can share about how we fixed some of these things.

Michael Galloway:

Some of them. All right. Thank you, Snehal. So first let’s talk about the awareness gap that Snehal brought up because I think that this was a topic that was discussed earlier on the main stage. And I think it’s a really key one to focus on first. So what we did to go after this awareness gap was we set up several different parallel efforts. The first one was really focused around culture. We accepted that this was a community problem. It wasn’t a tooling problem. So we kicked off AI jamborees, which were essentially company-wide demo days. Where we had 200 people per session and anyone could show something that they built. Really showcasing what’s possible, right?

We set up pizza demo days. And if anybody wants a piece of advice that hasn’t done a pizza demo day, you should do pizza demo days. People show up when you give them free food. So it’s a very good idea. It was low stakes lunch and learns. And it was driven by ICs and for ICs. So it wasn’t something really formal. We did AI open doors, where there was drop in office hours with our AI task force. So you could just bring a problem and ask for some feedback and questions. And then we did, and this is actually something that Tim mentioned, and I want to highlight because I was very proud that we did it.

We did these role specific workshops. So we focused on different kinds of roles in the company that were non-engineering roles. So finance, legal, design, ops. This is actually something that we’ll talk about in a few minutes, but it was a very big win for us.

The second thing we did is we went after friction. Most of the time we found, at least within Mercari, if we made it hard for people or very difficult for them to understand how to even get access to the tools or what the process was. Or they had to go through security reviews or get manager approvals. Then you’re going to shut down the whole program. And so we actually focused on that directly, and we made an AI bot in Slack that you could basically request and say, “I want cursor.” It would ask you a few questions and you confirm that yes.

And not only did it provision cursor for you, but it also wired up all the security and other configuration we needed so that it was ready to go for you. So that was actually a tiny little investment that produced an out-sized outcome. And one of the reasons why it did that is it set that understanding. It communicated implicitly that AI was important to us, and that we wanted everybody to try this.

The task force also, and it’s marked as an early win, but I want to just highlight a few things around the efforts here. So the task force went through, and they mapped 33 domains inside of Mercari in 100 days. And that was that what we mean by domains is these are functional areas where we could either automate AI, or where we couldn’t, and what the leverage points were. This is really important because it helped us understand where the low hanging fruit opportunities were, and it was far beyond engineering.

As a result of this, other parts of the organization that those role-based workshops started kicking off efforts. And so our help desk, our internal help desk round support built support bots effectively and started to pull the information in and it resulted in a 60% reduction in help desk workloads. Actually in platform engineering, we also started experimenting with this and was able to get about a 25% decrease in our support burden for our individual contributors inside a platform. So this was a very effective and early way to start leveraging AI and expose people to the benefits of it.

We also saw this one here was particularly interesting, this 45.7% drop in accounting tasks. So our finance and accounting organization, again, built support tooling and everything like that so that engineers or other people in the company could ask accounting questions. Usually things like month end close questions, looking up policies and that kind of material. I mean, on the policies alone, again, as Snehal highlighted, we’re a Japanese company, so you can imagine how many policies there are. And they are in both … some in varying different languages and detail.

But this was a really big win for us as well. This was another thing that was not only just a toil reduction for our business, but it was also really creating that reinforcing mechanism where it really exhibits, “Hey, we’re all trying to leverage this new capability in different ways.” And so it added to that kind of culture.

And then this last one here is we built an internal … Yeah, it doesn’t really tell you very much. It’s a BI agent. We gather a lot of data, as you can imagine, and that data goes into BigQuery. And querying on that data can be very complicated and then even making sense of that information that you query can be very difficult. So internally, we built an agent that sits on top of that called Socrates. And there’s a lot more to it than I’m going to just detail right now. But effectively what this enabled people to do is ask Socrates for information, as it related to our internal customer usage. Or you could poke at different dimensions of information.

And what this did is it actually helped a lot of people that had business intelligent related questions. But I think what was more compelling was is we saw a lot of people that never thought about asking those questions, now starting to ask those questions because we had suddenly made it much easier and democratized that information. This was a big win for us because it also caused that movement towards another term that I’m sure everybody is familiar with, that product engineer kind of mindset, right? If you want people to actually invest in building and involving your product and not just solving technical problems, you need to give them access to the information that will help them ask interesting questions.

And this was one of the things that we did that actually unlocked that. So we actually have 500+ weekly users. It’s actually growing quite a lot. We’re not a gigantic company, so this is actually quite a lot of usage for us. There’s about one in every four people in the company are actively using this.

Okay, so where’s the time going now? So over the last two quarters, we’ve invested a lot and in the entire traditional SDLC. And I’ll talk a little bit more about that. We weren’t measuring planning before, as Snehal pointed out. We actually did an internal survey and started to really dig into our planning process and how long it takes. It’s marked as plus visibility because now we understand it better. But we haven’t done enough, I think, yet to improve on that. Although we are moving towards some ideas around that.

For the code, I mean, this is no surprise, like everybody knows. The coding time has shrunk quite a lot. So it’s moved from weeks to days, and we’ll talk a little bit more about that. But I think I want to highlight something here that everybody else is probably very aware of as well. This is largely gains you see from greenfield situations, not as much brownfield, and that’s something we’ll talk a little bit more about. Greenfield has skyrocketed. Everybody’s very happy. They show their demos of something that’s non-consequential and everybody gets excited. Oh my gosh, how fast it goes.

Review. So we’ve implemented code reviews with agents. It’s saying in flight because there’s varying levels of quality with code reviews with agents. One, I’ve been taking a look at a lot of statistics to try to reconcile some of the things that I’ve been trying to make sense of about code reviewing. Generally what I saw is there’s more reviews happening, but I didn’t see any kind of change really in the time to approval, which was interesting on its own. So we’re still trying to kind of dig into that. I have a lot of theories on why that is. One real quick side note, there was a friend of mine from another company … Can I mention the name? Is it okay to mention the name?

Snehal Shinde:

Louis.

Michael Galloway:

Oh no, no, no, no, not Louis I was going to mention … Okay, Airbnb. Okay. They’re here, so I’ll just mention Airbnb. So apparently their way of handling some of the reviews. Is anybody from Airbnb here? All right. Okay. So on some of their code reviews, the way that they have sped things up is they’ll put a comment because they require several approvals on the code review. And so people will put a comment, “TBR.” Any guess what TBR stands for?

Erin Schaal:

To be reviewed.

Michael Galloway:

Yeah. Okay. You guys do it too. And that’s their way of bypassing the time it will take to review, and they just move the code forward. So I think that they might be seeing improvements in their review time. So on the CI time, we did talk about flakiness. We actually have improved the flakiness quite a lot. So we’re down from a significant level of … It was 17% in our Android time. We’ve spent a lot of time trying to improve our flaky tests.

CI, we’ve been able to bring down for iOS from 16 minutes to eight minutes. And CI, I mean, I don’t need to tell you all this is a critical stage of feedback. So the time saved here improves the overall experience, and it allows teams, most importantly, to shift their behavior. And this is something I really want to highlight. You want to shift the behavior to, when you have shorter time, to shorter and smaller releases, smaller changes, right? And so the more we can shrink the CI time, the more we started to see that we would see smaller changes going through. Which was a really good outcome for us.

Deployment time, which is a big one for us. So our P95 went from 48 minutes to 22 minutes, which is about a 54% reduction. And so getting services into the development environment, everybody knows is a really important goal for us. However, I will say a deployment time that’s roughly the length of time of an episode of Rick and Morty is still not really acceptable to me. So I want us to push a little harder. One of the guys in my organization is sitting in the front rows. This is the part he owns, so I’m talking to him. We could do better. Now, but I’m really proud of the work that they’ve been able to do there. It is a good win.

And then finally on the monitoring, how many of you have established … Well, first of all, how many of you are in platform engineering? Awesome.

Snehal Shinde:

Wow.

Michael Galloway:

These are the people. All right. So how many of you have actually established formal SLOs in your platform? Yeah. Okay, good. That’s good. That’s good. We did not until recently, it was very … Of course, we were aware of the availability of our systems and everything like that. But the amount of error budget was not always formally recorded. And the SLO itself, more than just being a tracking mechanism for us, was a mechanism for us to communicate our intention to the rest of the business. And that was a really important step for us.

We started to establish error budgets, or sorry, SLOs for availability as well as performance, and we’re adding to those more and more. And so that was really important also to communicate our intention in that area, and to build some discipline into what we prioritize as a group. We also, down here, and I’m not going to go through all of this … Yeah, the formatting is a little funky, huh? It didn’t look that way before. All right, so one thing I will highlight. So how many of you have spent time trying to right size your clusters?

Good. Keep doing that, because it’s going to get real expensive, real fast. This is one of the things that we spent some time on. So we actually built something in house and then we later decided to switch to a vendor that has worked out pretty well for us. And we’ve been able to optimize our resource utilization and our development environment. We’re rolling it out to production. 43% is massive savings for us. It’s massive savings for anybody. This is critical because we’re expecting more and more workloads. And I want to highlight this because all of the points that we’ve been talking about with agents making lots of changes, it’s going to bump across this very heavily.

Man, I’m running out of time. I got to move forward. High point here is that this is actually just KTLO work and general performance investments. But every bump that you have in your road, agents are going to hit it a thousand times more than your developers. And so it’s really important to invest in those. And it’s even more so now your license to explain to the business why you must invest in those things. Because if you don’t, you’re going to slow things down.

All right. So some of the wins that we have. 100% of Mercari employees are using AI tooling. This is up from 30% 12 months ago. It is helpful that it was a mandate from-

Erin Schaal:

Yes.

Michael Galloway:

70% of code generated is being used with AI. I want to highlight this doesn’t mean 70% of our lines of code. We don’t care about that. We do care about, are we utilizing the tooling and learning and benefiting from it? 64% year-over-year increase in the output of engineering measured by some of …

Michael Galloway:

Year-over-year increase in the output of engineering, measured by some of our internal proxies. Now output is really, it’s a rough metric, right? These are things like, okay, we’ve been able to produce more, our PR sizes have gone up, but I won’t say that we have seen 64% of value necessarily being increased in delivering out to production. And that’s, for another reason, we found code went faster, but release velocity did not keep up. PR spiked, but merge throughput lagged. Queue at the code review and the CI grew really big. We realized that code wasn’t really the bottleneck, it was actually more the issue was the idea to the code that was where we saw some of the challenges. And as a very real example… Oh, and this one was nasty, our maintenance PRs. So all the things where, yeah, we got some changes out, we thought it was really great.

We saw that we had to keep making more changes than we were expecting. And a lot of that has to do with the fact that code quality was impacted without a doubt, because it’s not as easy for people to review something that they did not write. So when you have… We started seeing maintenance PRs go up. One good story around the throughput. We had this project called NACO. It’s an internal project we had before we were looking at utilizing AI, the estimate was that it would take six months to build it. A team of two engineers were able to build it using some of the AI tooling, including making architectural changes, and they built that in a month and a half. This is green-field again, they saw a massive improvement in time. However, they got stuck not because of the code, but because they were able to move so fast that the PMs and designers did not have new things for them to work on.

And as a result, things slowed down. And this led us to this realization. We need to invest in earlier in the stage. We started to invest in what we called Agent Spec Driven Development. We brought agents into the front part of the design of the solutions where agent would read in this first stage, they would read the code, they would go through the docs, they would read Slack history, and it would build up the agent’s context early on. And then, we would get into the spec generation stage, or you can think of a little bit now the term is intent again. But this would be the new artifact where a human would work with the agent on figuring out what it is that we want it to build. And then you go into the implementation stage where in this case, humans or agents themselves would start executing on the work and then it would go through that validation or eval suite to regression tests, smoke tests and whatnot.

The reason this matters is because the second dip wasn’t because AI couldn’t write code, it was because the upstream was the problem. And this was our attempt to move AI to the front of the life cycle, not the middle. Where are we investing now? Brownfield modernization. Most of the AI hype is around green-field, as you all know. Brownfield is actually where the money is. And we’re starting to use agents to refactor, document, test, and modernize services. Critical observability. We’re not just observing the traditional parts of the system. We’re also looking at instrumenting the entire PDLC so that we can understand basically everything from, like I mentioned, the spec generation all the way through to delivery into production, and then even what the results are after production. That can help us start to identify other parts of the problem. Guardrails for agents.

Obviously, you can’t have agents writing payment code with extremely long… Or payment codes, we also do have a FinTech component of the company. We need to have strong guardrails and I have a lot to talk about on this one, but we won’t have time today, but risk management is critical here, and I actually want to challenge you all to think that the strategy should be that we change the way we think about risk mitigation. Heterogeneous small teams, we’re forming these small pods for our new projects and new initiatives. The two to three members that are mixed group with PM, EM and maybe an engineer or PM, engineer and maybe a designer to kind of run after these small projects. And then we’re rethinking what the PDLC should look like, the Product Development Life Cycle. We’re looking at more of the human on the loop as opposed to in the loop and starting to take a look at setting constraints, humans for setting constraints and direction, and agents do the execution.

And then of course, better dashboards. DX probably likes to hear that. But the better dashboards are for us, not for the agents, and it’s for us to better understand where the other parts of the problem are. If you think about it as a road, every one of those little potholes, you might have ignored because you were humans and you could get around them. The agents don’t know that and they slam into those and you don’t know that until they slam into them. And getting those dashboards up has helped us identify where are they hitting the bumps. This is what we are working towards is a multi-agent base… Well, it’s a multi-loop with basically a swarm of agents. We’re working on this now and it’s evolving the way we think about a development life cycle and thinking about it, like I said, as a series of loops.

Improvements made in the early… The very, very inner loops obviously have an outsized impact on the rest of the delivery. Basically, until you get through all of these loops, you have not delivered value for the business. I’ll quickly just mention, the inner loop is what you might think of. Here, it’s intent, code gen and validation. The faster and closer you can make these things to each other, the better you’re going to be. We’re trying to move as much validation as we can right next to the generation of the agent. If it’s running in a sandbox somewhere, we want an ephemeral environment, actually even more than the ephemeral environment, we want local testing, just like you can imagine somebody with their laptop doing that on their laptop. You want to bring it right next to the agent so that it doesn’t have much distance.

That means cutting things out, like not having to check in stuff in to Git necessarily, or not having to go through a PR and a code review cycle for it to validate what it built. The closer you can get to that interloop stage, the faster gains you’re going to get, the better gains you’re going to get. The ephemeral loop, we are making… We already have preview environments. We’re spinning up… We’re still investing in this to make it faster. The preview environments take minutes to come up. They go through a PR cycle. It’s a bit of Git Ops flow. We think we can do better and we’re trying to make this a little bit faster so that it’s basically an isolated space in our cluster so that the agent is able to validate something within an active running cluster, but that’s not exposed or deployed out to the environment yet.

Then finally, once you have that, you’re satisfied with where it is or rather the agent’s satisfied with where it is, it’ll go into a development or it’ll deploy out to our dev environment where it can run more robust testing, look at the integration situation and validate that what it actually built is working. Now, my personal goal, and actually the goal of Snehal and Raphael, who’s in the front here, going to mention you, we want all these steps, everything out to the purple to not require human approval. That’s actually what we’re trying to go for because we believe that if we can get this all the way through to the purple stage, actually past the purple stage, the development deployment, we will be able to unlock the better, faster iteration for the agents, but more importantly, we’ll be able to focus on the right problems, which are around guardrails and how to basically make sure that the code that’s going through our systems meets our standards without requiring humans to determine that.

That’s a really key goal that we’re trying to get to. The outer loop does still require and would still require humans for the foreseeable future, but this still can be done with canary deployments and progressive roll-outs that can be automated and rolled back. And then the measure piece there is, of course, our way of knowing did we build the right thing. This is essentially what we’re focused on. Some hard lessons. Just as a quick summary, gates slow you down. You have more than you realize. Go through and audit your system. I’ve gone through this multiple times. What does it take to build a service on our platform? What does it take to get a change to go through? What does it take to add new infrastructure? Ask all of those questions and manually do it yourselves. When you go through the process from an experience standpoint, you suddenly discover all these little points of tribal knowledge, all these little gates and approval steps that need to happen.

All of those are going to slow down your agents. Behavioral change is by far the hardest thing. I mentioned about eliminating human approval all the way through to the dev deployment to another individual earlier today or yesterday. And the first reaction they had is, “Oh, engineers are going to hate that.” And I was like, “Oh, why?” Because they don’t want accountability makes sense. They don’t want the agent to make decisions, roll something out, break something, and then they’re in trouble. And that’s a really good point. That’s a thing we have to combat is we’re going to have to figure out how to change our views around these things. So behavioral change is by far harder. You can change agents, changing humans is a little bit more difficult. The W-curve is real, so expect lots and lots of these dips. Every time you hit a dip, you’re probably going to have another dip.

We hit two of them. I expect we’re going to hit a third one. So it looks something like this, right? Our first dip was due to the fact that we didn’t have enough visibility. Our second dip was due to the fact that we had upstream and downstream that coding was fast, but there was other issues between upstream and downstream. And basically what makes these deeper and not shallow has more to do with the fragility of your platform components, heterogeneous architecture that makes it very difficult for agents to understand because they lack the context to know what to do. And then of course, legacy systems. Stuff that nobody wants to touch, agents will touch it confidently and we’ll break it.

This is the sum up slide and we’ll borrow a frame from Jim Collins, the Stockdale Paradox. So Stockdale was a prisoner of war. He survived because he held two ideas at once. Total faith that he would prevail in the end and brutal honesty about how bad the current reality was. And the people who didn’t survive were the optimists who just set arbitrary deadlines and then got crushed when they passed, right? So that’s how we’re approaching AI at Mercari. We absolutely are going towards a multi-agent future, we have total conviction about that destination, but we are also completely willing to look at the dashboards every week and see that dip and then pivot until we can improve. So faith in the destination, brutal honesty about the road. Every company has their own journey and without the right insights, you’re going to spend a lot of time on the wrong problems, thank you. We have 18 seconds for questions.

From AI experiments to organizational shift: Lessons from Mercari’s transformation

Show notes

Measuring AI impact

The reality of becoming AI-Native

The bottlenecks AI exposed

Finding AI workflow opportunities

Rethinking software development

Preparing for an agent-driven future

Timestamps

Transcript