In this episode, Varun Achar (Director of Engineering at Razorpay) explains how the Platform org has grown from a 15-person team owning everything, to 3 separate subteams. He also shares how they think about creating a culture of productivity, and some of the tactics they’ve used for increasing service adoption.
Tell us about your current role and the team you lead.
I’m the director of engineering in the Platform team at Razorpay. Razorpay is India’s largest FinTech startup: we offer payment gateway, neobanking, and lending services to businesses in India.
In the Platform org, my team builds the tools and services to enable application development for both the frontend and backend engineering teams. Prior to this role, I was one of the first engineers at another startup where I headed the consumer-facing product lines along with the engineering platform. I’ve also tried my hand on entrepreneurship three times in the past.
What did your team look like when you joined and how has it evolved since then?
The Platform org was fairly new when I joined back in 2019. However, the Platform and DevOps philosophy inside the company was already set in place in some ways. We were one of the first companies to use Kubernetes, so part of the fundamentals around deployment, CI/CD, infrastructure as code, et cetera, were already sorted.
When I started, Platform was a 15-member team where everybody did everything. In the initial years, Razorpay also focused a lot on getting to market very quickly because of the growth of the cashless economy in India. Therefore a lot of the capabilities that we built were part of a monolithic codebase.
Around the time when I had joined, Razorpay had started moving into the microservices environment, and that was where the need of a platform team originally came from.
In the early days, we started moving some of the shared services into the platform team — things like workflow services, batch processing systems, etc. And we started investing in certain things which were the need of the hour; things like feature ramping, experimentation platforms, Canary Testing, etc. A lot of these are primarily driven from the need of having a mechanism to migrate into microservices safely and also prevent severe outages.
So as the business grew, the push toward microservices and the engineering team all grew too. We exploded, I think, around 2020. The pandemic was actually good for Razorpay because of the digitization of a lot of businesses. The overall growth of the organization grew, so the need for investing in platform also grew.
At that time, we realized we needed dedicated teams focused on solving some dedicated problem statements. That’s where the need to have focused teams solving for reliability, security, and developer productivity grew. So today, on the platform team, Razorpay has several teams focused on these three high level themes.
Can you give a quick breakdown of those three sub-teams?
So we have security: this is different from application security, that’s a separate security group. The platform team is more focused on things around perimeter security, DDoS protection, authentication, authorization, and the Platform security team is one group which deals with that.
There’s reliability. Here, we’re building some of the things around enabling a safe microservices environment. Some of our core components like Async Message queue, and how do you enable reliable message delivery across microservices, the reliability team focuses on.
We do some amount of chaos engineering. We did not create a full blown team around it, but we do have abilities to do chaos engineering.
Then we have teams which are focused on incident recovery. So, we have an observability team over there. Then we have teams to enable safe releases, so we have somebody focused on building the release train.
On the dev productivity side, we have teams which are focused both on frontend dev productivity and backend dev productivity. We have teams which are focused on building the toolkits, which are needed to build microservices, test microservices, and create sort of an integrated development environment. It integrates a bunch of these tools that we’ve built, tools around logging, monitoring, deployments, etc. So this team is focused on creating that integrated dev environment — and similar things that are happening on the front end side.
How does the platform team, and the different subteams, think about measuring success?
Actually we’ve gone through a bit of a journey in how we think about measuring the success of the products we’ve built.
Initially when we started, we would look at metrics like how many services were integrated into the tools that we built. But that wasn’t really giving us a full picture. Metrics like “how many services have integrated” don’t really give you a view of the usage of the platform. Hence, around a year or so ago, we switched our mindset to think of our internal engineering teams (which are the product engineering groups) as customers, and we started adopting metrics that are more usual to what typical products would have.
For example, for our observability platform, instead of looking at how many services are integrated we would start looking at what are the monthly active users, what are the engagement metrics like session length were, what do our funnels look like, along with the engineering-focused metrics around SLAs, right? Besides that, we also start looking at some metrics which are important for the customer. For example, for our testing platform, we’d look at metrics on how long the wait time before that starts executing is, and what the flakiness tests they’re executing is.
So we instrument our SDK and inject things like Google Analytics, publish metrics onto platforms, analytics platforms, and get to see exactly what the customer’s doing and that helps inform our strategy.
Each team identifies their own metrics as well. Metrics change depending on the maturity and the adoption of the platform as well. For example, when we are trying to drive the adoption of a service bus that we were building internally, we looked at not the number of services that are migrated from the older message broker to what we built, but the numbers around the publishers and subscribers and chased growth over there. The idea being that if we get even a single integration from a customer, the likelihood of that customer doing more integrations would increase.
This strategy has served us well and it’s become more of a playbook internally to show impact both to our customers, and also give a sense of accomplishment to the team. Platform teams often suffer from this problem, having very high lead times before they start seeing meaningful impact. Adopting this strategy has kept the team motivated and kept the roadmap also flexible.
Initially, with the launch of a product, we focus heavily on early adopters and our roadmap is shaped based on their needs, and these adopters then become evangelists. We have roughly seen around a four to six month period in which a product goes through the typical product adoption curve of going from innovators to laggards. So this strategy has worked well for us and it’s now repeated across products.
Before this call, you had mentioned that the way your team thinks about productivity has changed over time. Can you talk a little bit about that story?
So in my previous org, I was handed over the role of DevOps and one of the first things I did was to read the DevOps Handbook and the state of DevOps reports, both by Jez Humble and his team. What I read was quite eye opening and fundamentally changed the way I thought of teams, systems and processes.
When I joined Razorpay, I, along with a couple of interns, pulled out the dev productivity metrics for the org, and we also used Pull Panda to get some of these metrics. We built a lot of other dashboards through hooking up to GitHub’s data. We looked at different data points and built dashboards for the org, looked at some custom cuts, like what were the productivity metrics of devs contributing to a monolithic codebase versus microservices.
As part of the study, I created a doc on how to interpret the results and gave recommendations to teams on what sort of practices to follow. We’d get devs to dedicate time on PR reviews, look at PR workload of engineers, and assign PR buddies to reduce the effort on one particular engineer. A lot of these manual process interventions were made with the idea that, eventually, teams would practice these behaviors and then we can move towards trunk based development and move towards continuous delivery, etc.
What we saw was that you can’t go halfway with this approach. You really see productivity gains if you’ve gone whole hog and adopted everything, and created a culture within the team. Teams which were able to do so saw huge gains, but most teams couldn’t pull it off.
We realized what we were trying to do was change the engineer instead of fixing the systems around the engineer, and changing behavior is obviously one of the toughest problems to crack. The last one year or so, we started investing in tools rather than processes. We decided to let engineers be who they are and instead use tools to fasten what they were doing already. For example, they needed testing environments instead of asking them to collaborate with each other, merge PRs and deploy into a shared environment, we gave them a dedicated ephemeral infrastructure.
Another example wash with automated testing. We were already running thousands of automated tests. So instead of discovering breakages in pre-production and running regression tests, during pre-production builds, etc., we ran automated tests for PR and built infrastructure to enable testing to shift left, which is closer to coding. This has helped reduce overall cycle time for feature releases.
These strategies have started showing results and we’ve seen a big uptick in the usage of platform tools as well. And engineers love what we are trying to do. Therefore, we now have a dedicated dev productivity team, like I mentioned already, which is focused on building this integrated development environment. So engineers can be who they are and we can help them while not trying to change their behavior.
You mentioned this need or desire to build a culture of productivity. What does that mean and how are you tackling that?
One of the things that we’ve seen in my past experience and it’s also a culture within the organization, which is a culture of transparency. We want to give data to people and allow them to make their own decisions. Therefore, what we would like to do is inculcate this culture of looking at dev productivity.
Usually, when you walk into engineering offices in India, you have these monitors everywhere, where you are looking at your production incidents, if there’s an ongoing incident, what are your metrics, etc. So everybody has a sense of what’s going on, both on the product side and on the engineering side. And whenever there’s an incident, people are aware and people jump in to solve those things. But we don’t do that for dev productivity. And it’s one of the key problem areas, where engineering teams are not focused on trying to improve themselves.
The idea is that, through this dev productivity team, we want to do a lot of surveys, push the data back to engineers and empower them to get better at what they’re doing. And we’d work with their engineering managers to inculcate practices which have helped other teams become better etc.
We are thinking about dev productivity metrics as an analytics product. Say, for example, another team has adopted a particular tool, how has that tool benefited them, right? Give that big data back to engineers, make it available for them to use it, and see it every day.
One idea was, essentially, maybe we can build a Google Chrome plugin and sort of make that as the landing or the home screen for engineers when you open up a new tab. You have this data available for you whenever you open up a browser. So ideas like that, we wanted to sort of experiment with and essentially turn productivity into a conversation that happens daily so that we, as the dev productivity team, also get a lot of feedback from our customers because engineers are thinking about productivity on a day to day basis.
There can be a tension within productivity teams and product engineering teams because dev prod can’t fix everyone’s problems, right? How do you think about enabling teams to fix their own problems? What are maybe some of the challenges you run into there?
We’ve had a different strategy about that on how platform team is thinking about building and driving adoption for the tools that we’ve built. Internally, we have an open source culture, so a lot of engineers across teams participate in the building of internal tooling.
But apart from this, we use OKRs very effectively inside the organization. What that means is that product engineering teams have the right to choose what tools best fit their needs to chase their goals. So if a product engineering team finds that there’s an external product which better serves their purpose, even more than what the internal platform team provides, then they’re free to switch over.
This does two things. One is that it enables the product engineering teams to take ownership of their destiny, and at the same time provides a healthy pressure on the platform team to think about product engineering teams as customers and build something that they really find valuable.
So the internal teams are usually supportive of the effort that the platform team is doing, and therefore they contribute to our success. We regularly find engineers from other product teams to come work on the platform team full-time for a quarter or more to help us ship things faster. That’s typically been our strategy. Usually, when we try to drive adoption internally, like I said, we look at the early adopters, so we ship a lot of our features based on what these early adopters want. And typically we find early adopters based on where the problem is the largest.
So, that’s how we run our strategy. What we’ve seen is that product engineering teams want to contribute in building these common utilities, platform, tooling, etc., that can then get used across the org. So far, that’s been working well for us and this healthy pressure that we get by providing freedom to our product engineering teams to explore externally helps us also prioritize the right way, build dev experience-related tooling into a product that we built, think about documentation and so on.
Earlier you mentioned how in the past you had used certain Git and other metrics, like the DORA metrics, to find intervention opportunities with different teams and share best practices or attempt to change the way that we’re working. When you did that, was that being driven by your team in terms of identifying teams that may be having problems in reaching out to them? Or were all the individual product teams owning their own SLAs around these metrics and driving their own improvements?
What we had done post the discovery of these issues was to set up engineering OKRs at the leadership level. Most leaders signed up for improving the productivity in their teams. And initially, there was no dev productivity team in the platform team to drive these things. It was just me and a couple of other people who were the voice on what were the right things to do to move towards, say, trunk based development and continuous delivery, etc, and what are the stepping stones that we had to take to get there.
We internally had these monthly engineering reviews where what we had done is that I would present a high level picture of what our metrics looked like, what our lead time to change was, how frequently we were deploying into production; the typical DORA metrics along with some metrics which are important. For example, when we’re migrating from a monolith to a microservices environment, productivity of engineers working on monolith versus microservices, etc.
So I’d give them a high-level picture and then each engineering leader would then come in and present their own metrics. And this review meeting became a mechanism to check the progress of these engineering teams, but it wasn’t very successful overall. People would put in a lot of effort and, like I said, there were fundamental problems that we weren’t able to solve. For example, every time we were introducing new engineers into the company, the whole training-retraining process would start, and a lot was dependent on the engineering managers themselves. So if an engineering manager is newly joined, the success of the team would depend on how well this engineering manager would adopt these metrics and push for them internally. That’s how it was. So it was essentially left to the product teams. I would give them a high level picture and the teams would then come and present their progress, but that was essentially what we were trying to do initially. But it hasn’t really worked out too well for us.
So it sounds like just getting enough buy-in or commitment from the managers to focus on these metrics with their teams was a challenge.
I mean, it’s not about getting buy-in from these managers, but it is about the fact that it takes a lot of effort. It’s just some manual effort. You just have to keep telling people over and over and over again, before it starts sinking in.
Some teams who understood this really well, they switched over instantly and they were seeing huge gains. So it’s about the fact that you have to focus a lot on getting your product delivered, thinking about what your customers want, looking at a lot of engineering stuff that’s going on, tech debts, incident management, etc.
There’s a lot going on especially in a startup, where everything is moving very, very rapidly. It’s easy to slip away and these are the sort of things that don’t turn into number one priority for teams. Therefore, we realize that it’s better to adopt a different strategy. But yes, you’re right, it is difficult to change these behaviors and practices.
Are these self-improvement efforts not always a top priority for all teams because they have other deadlines?
Also the fact that, like I mentioned, you’re not looking at this data on a daily basis, so you don’t know when you’ve slipped. Some sprint, you would be fine, on another sprint, it won’t be that great. And we usually see these trends over a three, four-month period and putting in continuous effort over a three, four month period is hard. It’s like building a new habit, right? Building new habits is always difficult.
You mentioned you had set these at the OKR level. How else did you engage the organization to champion this and get everyone involved in this process?
One was these forums where we would discuss problems. It was a conversation where all engineering managers would attend. So we would use those forums to actually talk about the gravity of the problem. And as these engineering leaders would sort of surface their own particular problems, it created this awareness.
But building awareness was not the only thing that we needed to, right? So one big problem, like I mentioned earlier, was that people don’t have access to this data on a daily basis. They don’t look at this data on a daily basis. We don’t see change on a daily basis, so building that culture of productivity is a hard problem. That’s what we are attempting to do this year.
But the primary mechanism to champion this and to bring focus to this was these reviews meetings, because it would be one forum in which everybody would come and it was a monthly meeting and you’d have our head of engineering and our CTO attending.
We’d have an open conversation about these problems and each engineering team would make action. Before the dev productivity team within the platform team became a full-time dedicated team, they were individual teams trying to solve their own problems.
In fact, the automated testing per PR was not something that the platform team initiated. It was something that one of our product engineering teams initiated because they felt like testing late in the release cycle was hurting their productivity. So they built it and then eventually the platform team took over that effort and rolled it out across the organization.
Putting in that focus, putting in that initially, doing that initial conversation with these engineering leaders in an open forum has helped in creating awareness of this problem. This was a couple of years ago, so it’s been two years since most engineering teams have taken some productivity goals. This is the first year in which we have a dev productivity engineering team, which is supporting their needs.
So I would say awareness has been created inside the organization, but investments in the right tooling is something that we’ve done in parts but now we’re doing in a dedicated fashion.
You’ve written about the skillsets a platform engineer needs to be successful. One skillset you called out was that platform engineers need what you called the “product mindset.” Can you explain what that means, and how does a platform engineer apply this product mindset to their jobs?
So most teams on the platform group in Razorpay were initially initiated by engineers who had been in the company for some time. And therefore, by default, they had a lot of context about the problems that engineers faced. This is usually the approach that most companies take.
If you look at the overall job profile of a PM, the primary responsibility is to represent the customer and understand what their issues are. So most teams would have one or two engineers who could represent their customer because they themselves were facing these issues when they were part of the product engineering team. That strategy helped us in the initial phase when we would kick off a team and then we expand the team around these engineers.
Once these teams were formed, these engineers were encouraged to go have focus group discussions with other engineers, interview them, collect feedback over Slack and surveys, etc. And every quarterly planning would be preceded by this effort, like having these customer conversations. As part of these focus group discussions, we would actually identify what are the key problem areas that our customers are facing. This is essentially what a product manager does, right? Before we actually would start investing in something, these same engineers would go and write a product concept note, using a template that other product managers would use for actual customer facing products. As part of that template, engineers had to do their market research, go to market strategy, define impact, etc.
This exposed these engineers as well to the whole concept of thinking like a product manager, and that has actually helped us a lot.
When a product is launched, the first few integrations are done by the engineers that have built that platform product themselves. It also exposes them to the user experience of their own product. All of this has helped generate some customer empathy, create context, and train them to think in a certain way.
As these engineers mature, they then take up the role of a senior engineer and then start representing the customer internally.
So that strategy is something that we’ve used. Essentially, get a senior engineer who’s been in the company for some time to start a team, build a team around them, and do these typical product manager responsibilities through these engineers.
In India, one of the big problems is that you don’t get product managers easily for platform teams. It’s hard to find, say, a product manager who’s worked on an observability platform. It’s hard to get a product manager who’s worked on dev productivity tooling. Therefore, these engineers are in some sense also forced to play the role of a product manager.
But it is part of the culture inside the organization that before we build anything, we invest in this effort of going and doing the research, putting out a product concept note before we do any tech spec-ing or start writing code.
All of these things have helped in inculcating this mindset within the engineers. And then when you tie it up with the OKR model that we have, where we take certain metrics to drive adoption. Adoption is one of our key metrics for each quarter whenever we sort of discuss OKRs. Driving the overall success of the team through what you’re building and understanding customer pain points, and then actually delivering value and showing that impact is inculcated in most engineers inside the platform engineering team.
How does that product mindset idea bleed into hiring and recruiting? Do you look for a particular type of engineer when hiring?
We don’t really look for a product mindset when recruiting engineers, but when we do when we’re hiring engineering managers for the platform org.
As part of the engineering interview process, we have a product round, where a product manager will interview the engineering manager. The purpose is to try and understand first principles thinking: Can this engineering manager actually think about the customer and identify what their pain points are? How do they come up with solutions to the customer’s problem?
So it’s not a conversation about the number of ideas that you generate, it’s about the process that you follow to identify who your customer is. What are the key problems? How do you define a strategy around trying to solve those problems? How do you get to market for the product that you’ve built, etc.
What’s one tactical win you’ve been able to deliver that other teams can learn from?
One of the big problems of most platform teams is actually getting adoption of the products internally. Like I mentioned earlier, we don’t have a top-down philosophy when it comes to adoption. Platform teams actually have to go and convince our customers and demonstrate value to the platform. And the big problem is how do you do that at scale, and quickly.
We faced this problem, and specifically we were trying to drive adoption for the observability platform that we’d built. We wanted to figure out how to expose engineers to the features of the observability platform in a real world situation, without going through an expensive integration process. And one of the ideas that came out from the team was to try the concept of GameDays.
So the team hosted an internal competition where we simulated a real world outage in a microservices environment. The objective was to find the root cause as quickly as possible using our observability platform.
We launched an internal marketing campaign about the competition and the prizes you could win, and we got around 40 teams to participate. The teams would get points based on how quickly they identified various events in the system and eventually find the root cause.
The event was also themed around a startup called Doge Cinemas, with a website and a ticket booking system. We made movie posters and a logo for the company to make it fun. In the background, the team had repurposed a few open source microservices to create this distributed system, through which we could sort of inject failures and create that environment for these teams to try out the observability platform.
We built independent environments for each team. We had roughly around 250 engineers who participated in this competition and that really helped.
It was a fun event. People tried out the platform. They learned to use the features in a safe way while trying to debug a situation that they would really face when they have an outage in their service.
And actually, after this event, we saw a huge uptick in the users of the platform.
This is one of those things that we’re trying to turn into a repeatable event, done quarterly.
We actually got exposed to this idea through participating in other game days that are hosted by some of our partners that we use within Razorpay, and was something that we could do internally as well. So we did it and it worked out really well for us.
And quick follow up to that, was something that happened in person or was this remote?
This was remote. This happened during the pandemic, so this happened remotely. But it was a well-organized event. We did all over Zoom, so we tried to make it fun and engaging.