Podcast

Leading platform engineering at Trivago

Thomas Khalil, Head of Platform at Trivago, describes how the teams reporting into him are structured, the tactics they’re using to increase awareness of their work, and how they demonstrate their impact.

Timestamps

  • (1:17) The pillars of the Central Platform organization
  • (2:18) The organization’s focus on time to market and efficiency
  • (3:09) The differences in developer experience between teams
  • (4:37) Deciding whether to consolidate
  • (5:57) How platform, developer experience, observability, and SRE teams interact
  • (8:40) How these problems were being tackled previously
  • (10:09) A failed attempt at rolling out Backstage
  • (13:57) How SRE squads are organized
  • (15:48) How to motivate platform teams
  • (17:23) Demonstrating the impact of the organization
  • (18:51) How the data is collected
  • (22:44) How they’re increasing awareness for their work
  • (23:54) The DevEx pillar
  • (25:57) How their roadshow will work
  • (28:11) How DORA metrics fit into their measurement program

Listen to this episode on Spotify, Apple Podcasts, Google Podcasts, Overcast, or wherever you listen to podcasts.

Transcript

Abi: Thomas, thanks so much for sitting down with me, coming on the show. Really excited to chat with you.

Thomas: Thank you so much for having me.

Abi: Well, we were having such a fun conversation before we hit the record button here. So I think we’ll backtrack a little bit because I want to make sure listeners get the full scoop. But maybe just start by sharing a little bit about your current team, what its focuses are and how it came to be.

Thomas: So I lead the SRE function in the company as well as what I call the central platform that is comprised of three pillars. So we have a platform team, which is responsible for the cloud landing zone, tenant management, governance, some DevSecOps, as well as our main deployment runtime, which for us is GKE as well as the baseline that we deploy on top of that. In addition, we have developed experience team, which is more focused on tooling as well as golden paths, documentation, our tech radar are there to basically make the daily lives of the SREs a little bit easier and more predictable, and ultimately the rest of the tech community in the company. Last but not least, we also have an observability pillar, which takes care of the metrics and logging and distributed tracing needs for our entire tech footprint as a cross-cutting concern.

Abi: Before the show, we were talking and you were telling me a little bit about your team and I said, oh, so this is all for reliability, right? Your company probably cares a lot about reliability, you said no, actually it’s more time to market and efficiency. So can you explain what you meant by that?

Thomas: I didn’t mean that the company doesn’t care about reliability. We care about reliability a lot, don’t get me wrong, but as usual, it’s not a mutually exclusive thing. Reliability is really important. We need to keep the lights on. The business needs to continue to operate, and at the same time we want to be really efficient and let’s say on the delivery side, more predictable. It shouldn’t be, okay, I’ve landed in this new team, let me spend a week figuring out which CI environment is used and how do I get a new build out. We want to make these things really team agnostic and predictable to the point that they’re boring and people don’t really think about them, they just do them.

Abi: Also before the show, you shared something I thought was really interesting. You mentioned that historically when if a developer switched from one team to another, it was as if they had switched companies in terms of the difference, in terms of tooling and delivery process. So how did things end up like that and can you share just concretely why that was?

Thomas: It’s a combination of things. There’s a long sort of history that comes with the back of it. So there were various teams operating on various products in various geographies. Some of them tended to have particular preferences for doing things a certain way, and there’s also some historical acquisitions that came along the way and it just ended up in the end with a huge fragmentation of tooling and platforms. I think at one point we had our data centers, we had stuff on AWS, stuff on Azure, stuff on Heroku, stuff on Digital Ocean, probably forgetting a few couple of other ones as well. It really and truly was like joining another team. This team is using PHP and Apache and they’re deploying on AWS. This other team is using Ruby and they use CircleCI and they deploy to Heroku. At some point we decided to consolidate and move to sort of one provider and off the back of that move on to other topics like CI. Okay, do we really need six different CI platforms? Let’s shrink that down and so on. That’s pretty much how it went.

Abi: This duplication problem is a common one. It sounds like you had a particularly troublesome case of it at your organization. How did you start to reason about… Consolidation isn’t always better. Your team is in service of helping developers be happy and productive. How did you reason about where it did and didn’t make sense to consolidate?

Thomas: It was sort of approach from a different angle. It was more as you further go up the stack and using more, let’s say serverless things, there tends to be an element of lock in around them. Those solutions don’t tend to really be portable. So whenever a team ended up inheriting something else from another team, the first reaction is, oh my God, what is this? I’m not really productive. It’s really difficult to iterate on this. We should do something about it. This sort of drove indirectly, these rewrites and these moves. Google Cloud sort of emerged as more the most friendly way for developers to do things with the least amount of complexity. I mean, it’s not perfect, don’t get me wrong, but it was much easier and there were enough people that could also provide help and support along the way. So that’s kind of how that emerged as our, where we ended up. So we tend to prefer more portable microservices rather than more serverless type architectures, event based architectures.

Abi: So you lead this team and you threw out a lot of labels as far as the different things the team’s focused on and how it’s organized. So I want to drill in a little bit deeper, have you explain the structure and the focus. So you mentioned DevEx, you mentioned observability. Can you break it down one more time? How is your group organized? What are the different pillars or lanes that folks and your team are focused on?

Thomas: Let’s call them pillars, just to be consistent with the example. So we have a platform pillar, and that basically covers the responsibility for providing a common runtime. So what does that mean? So that means Kubernetes to some extent, networking. We also collaborate with our networking team and essentially what needs to be provisioned such that the runtime is usable. So service mesh, all the shared services, et cetera that come along with that as well as the governance function. So basically think about the cloud landing zone essentially, that’s catered for by the platform team, tenant management, so on and so forth. Developer experience is there to sort of produce the golden pipelines, provide guidance to the organization and provide clarity and instruction where there isn’t on how to push forward with things. They’re also subject matter experts when it comes to CI, for example, how to write efficient terraform modules, so on and so forth.

Observability cover the usual suspects, logs, metrics and distributed tracing. Together they sort of form basically the central platform offering, bells and whistles, batteries included. Adjacent to that, we have a few SRE squads which are responsible for maintaining the operations and uptime and reliability of the business’ applications basically. They work very closely together and are able to also federate the communication because when you end up with one group of people that is doing everything, it gets really messy really quickly, and we cater to I think about 40 different teams. So it also makes stakeholder management a lot easier. It also provides clarity on the customer end because they always know who to go to speak to.

I know Dennis from my SRE squad, he’s my guy. I’ll go to them and say, hey, I need you to help me out with this. Basically the squads interact a lot with each other and they’re able to bubble up a lot of the common things that are happening throughout the organization and this to the advantage of the overall platform sort of team, because they can still have their ear to the ground and see what are the things that are coming up that are problematic that they should probably jump on before they become an even bigger topic.

Abi: I definitely want to ask you more about how your group keeps an ear to the ground and focuses on the customer, the developers across the organization. But first of all, thank you so much for that breakdown of your group. I was taking notes. I liked the way you have it organized. I want to ask you, so before you came to this current structure where there’s a lot of clarity around the different lanes and the different pillars, what was the old way in which all these problems were being tackled or not tackled across the company?

Thomas: Well, I still like to think that we’re not there yet. It’s a journey with these things. I’m not quite even sure one can ever reach the final destination, so to speak. But we are certainly in a much better position than we were initially. Initially we were just one group of people. Actually worse, we were one official group of people. We had a litany of similarly functional groups dispersed all throughout the organization doing very similar things, but being called very different things, be it DevOps, be it release engineering, be it whatever.

Coordinating things was extremely difficult, highly reactive environment. It was more fly by the seat of your pants kind of thing. One of the biggest problems we also had was whenever someone wanted to get something done, they’d sort of pop up and it was like, okay, who feels responsible for X? Who do I go to do this? Depending who you ask that question to, you’d get a different answer. So the idea was, like I said, first to bring everyone under one roof, sort of bootstrap a community of SREs, get a lot of exchange going on, and also being in a position to communicate to the rest of the organization, this is how we’re structured, these are your contact points. Please go to them.

Abi: You were earlier talking about some of the prior struggles that you’ve had around the platform side. I’m particularly a failed attempt previously to roll out Spotify’s Backstage platform internally. So Spotify Backstage is something that comes up a lot with people on this show and platform teams across the industry. I don’t know the ending to that story because I don’t know if you eventually got it to work, but I want to start with the failed story first. Tell us a little bit about what happened and why you think that didn’t get off the ground?

Thomas: My theory was it was largely due to a couple of reasons, timing being probably the biggest one. So in terms of your readiness, if you don’t have a good coverage of IAC, if you don’t have a certain common surface area, a certain standard around how do I maintain my docs, my documentation for my projects and so on becomes really difficult to launch such an initiative. I might be wrong here, but to my knowledge at the time, the way Backstage sort of worked was very API centric. So you deploy Backstage, you configure your plugins, and you basically say, here’s Google Cloud, here’s AWS, I need you to start creating stuff. That really clashes a bit with IAC and the GitOps ops ways of doing things. So I think the ecosystem has matured a lot in the meantime, and now it’s probably in a much better state to adopt on the basis of infrastructures’ code and GitOps.

Abi: So when you say that this initiative failed, what do you actually mean? Was it that someone came in and said, hey, stop working on this because we don’t see value? Or did people not care about it? How did it fizzle out?

Thomas: There was not a lot of customer uptake. So basically people didn’t see the value in it. They’re like, why would I look at this when I could just go to GitHub? I can see my port request in GitHub. You’ve put in here an iframe for Kibana, but cool, but I have that bookmark. What’s the value that I get by going through this thing? So in some ways it was sort of a solution looking for a problem. When I also think back to that time, we were still had a ways to go with regards to our migrations. We’re now much closer to the end, and once you have everything in one place, it becomes a bit easier to sell this kind of thing. Not only that, but in the meanwhile, we’ve also done a separate initiative to come up with a standard for how we curate technical documentation for all the services that we have, which is based on MK docs, which just happens to leave the door open in the future if we want to revisit backstage.

So that puts us in a much better place to experiment maybe with a smaller group of users and see what works well and what doesn’t. From a company’s perspective, one of the biggest things we aim to get out of something like this is being able to answer very basic questions like, how many applications do we have deployed? Where are they deployed? What tier of a service is it? Is it a gold tier? Is it a silver tier? Is it a are bronze tier? Which team is responsible to this application? Where do I find the runbooks? Where do I find the dashboards? All those sort of things. So yeah, I’m really excited to see how it’s going to turn out.

Abi: Yeah. I’m curious, what’s your current solution to, especially the latter part there of, what are our applications and where are they deployed? How do you track that without a comprehensive system like a developer portal?

Thomas: At the moment, it’s more, what can I say, urban myth and legend and stories shared from one generation of SRE to the next, no kidding, a number of things. We’re mainly on Google Cloud, so we rely a lot on the good stuff that people at Google provide through the console and the various GCP APIs as well as Argo CD. So we can glue some calls together and we can get a pretty rough idea of what we have.

Abi: Earlier you were telling me about an interesting way you organize the squads, particularly the SRE squads. Would love to ask you to share more about that. You mentioned you had some outside inspiration that guided you toward that approach. So share with listeners your approach and how you found that to work?

Thomas: So as I mentioned initially, there was one sort of SRE team, that was the kitchen sink team as I’d like to call it, that we’re sort of doing everything. That was great because there was always lots of excitement and lots of things going on, but the problem was there was no focus. There was all kinds of stuff going on, and it’s super reactive. Not only that, but you also had other teams that were doing similar things called by different names in other parts of the company. So we decided as a first step to bring everyone under one roof and basically bootstrap a bunch of squads mainly around a particular workload type.

So the thinking here being we could roughly split up all of the workloads that we have in three distinct buckets. We can have a bucket that is big data workloads, stream processors, Kafka, databases, stream processors, batch workloads, and we thought it makes sense to create a squad that would cater to those type of workloads and be able to work really closely with the teams that were mainly operating in sort of that technical field. We also created a SRE squad that mainly looked after our backends and search engine. Lastly but not least, we created a squad that takes care of our APIs and publicly sort of facing workloads, main website, GraphQL, that sort of thing. The idea again is to create areas of focus where folks can gather expertise and also apply them to all of the workloads that are under their care.

Abi: Well, I love this thoughtful way that you organize these squads and how you organize your broader group. I want to ask you, how do you inspire your team? A lot of these platform teams, you’re not shipping the thing that’s in the press release for your company at the end of the quarter. You’re not being mentioned on earnings calls and things like that. So what’s the currency? What’s the energy? Where does the inspiration come from? How do you drive your team to be excited about the work you’re doing? That’s internal facing?

Thomas: Basically setting goals, achieving them and celebrating them. Probably to my own detriment, it’s sometimes seen as a walking and talking motivational poster. Sometimes it’s really easy to get hung up on, oh, this thing is, I underestimated it. It’s taking me so long, or oh my God, it feels like this is taking forever. People need a reminder occasionally, like, hey, you’re actually doing much better than you think you’re doing. It’s also been really great to see some of our more junior people thriving in this structure.

So we also took on, like I said, folks from other parts of the organization that were sort of isolated and didn’t have anyone to fall back onto, especially when it came to technical advice or how to tackle a certain type of problem and what a difference a year has made for some of these people. It kind of makes the whole thing worthwhile. It’s really setting people up to succeed, celebrating the wins with them and yeah, the sense of accomplishment, let’s put it that. So we have a good idea of what the end state looks like, sort of break it, parceling things up into more digestible blocks, so to speak. Otherwise, it’s just such a long tail. It can get demotivational pretty quickly, so it’s not perfect. But yeah, we’re still trying.

Abi: Makes sense. Well, sounds like you paint this future state that the team is lined around and you celebrate the wins along the way with clear milestones. How do you show your boss that the stuff you’re doing is working? How do you actually show the impact of what you’re doing and how’s that shown back to your team as well?

Thomas: It reflects back in a few ways. One of them was the metrics around the cost of incidents that we’ve had year over year, which we saw a pretty significant drop in. It’s like a fivefold decrease, I think it was, year over year. Also, happiness retention, also things like internal surveys and just the overall feedback also from the customers and the fact that we’re better able to absorb the incoming workloads, be it doing migrations to the cloud or working on new products or tackling longstanding technical debt. In our organization the work is really visible. So in terms of our planning and the deliveries of it, and we do regular updates on the things that we’ve achieved. So the work gets a lot of visibility and everyone seems pretty happy with it. Everyone’s been pretty receptive to these efforts and very encouraging, which is awesome.

Abi: Got it. So you are measuring your impact based on number of different things, in some cases it’s cost. In some cases it’s the retention and satisfaction of the developers across the company, or maybe just even qualitative feedback you’re hearing on the ground. Who collects all this data? Who is running the surveys or capturing the feedback from your users? How is that work being done?

Thomas: So at the moment, it’s mainly done by the squad leads. So each squad has a designated lead that is responsible for the stakeholder management of their customers. We meet regularly, we discuss what’s going on, and we’re able to also plan and position ourselves based on the organization’s needs. So that’s what we do on that front, on being able to measure it more accurately. We have multiple initiatives that are going on concurrently. We aim to formalize this measurement through a survey, which we haven’t done yet, and on the developer front, we have in the very early days of doing qualitative surveying of our tech population.

Abi: How do you think about the timing of rolling out some type of wider survey or developer listing program like you’re alluding to?

Thomas: Yeah, like I said, one of the things I believe in a lot is when you’re up against a pretty gnarly problem, timing is something I think people underestimate. When you asked me earlier, why did that initiative fail, and this happens a lot, especially with people that maybe stayed in one company too long. Whenever you bring that up it’s like, oh, we tried that four years ago. It just didn’t work. Is you’re crazy. Why did that initiative not work? Unless you have a good answer to that, I would argue you really haven’t learned anything from that failure and you’re probably doomed to repeat it and timing and the readiness and the culture, the receptiveness to the organization, the alignment of all the forces that sort of align and bring things to an opportunistic state where you can make a move and be successful, be it we’re further along in our migration or our people are more competent and comfortable working in the cloud or building and deploying their own services. Timing is very, very critical.

Abi: As far as your existing data collection efforts and feedback gathering, what’s the meta feedback, I suppose? What’s the feedback you get about the feedback you’re collecting? Are you collecting signals that seem credible? Are they providing enough for you to show that impact, not just internally with your team, but upward to leaders across the company, even non-technical stakeholders? Just curious about do you have enough right now to truly show your bosses and bosses’ bosses the impact that you’re having on the business?

Thomas: So in a quantitative manner, yes. Like I said, I can pull out the bill and we can go through that. We can refer to all the efforts we’ve put into reducing our tooling landscape and various other things. The qualitative side, it’s mainly been, like I said, we started recently with surveying our developers because we believe at the center of all the initiatives, it’s all about people in the end. Our main effort is to make people’s, increase people’s productivity and make them more, feel more secure doing the things that lie outside of their comfort zone. How to make deploying on the cloud predictable. How not to accidentally get a $50,000 bill because I forgot something when I was provisioning a cloud resource. It’s still early days in terms of our experience with that, but some of the casual feedback, meta feedback I’ve had about that is, oh my God, this is awesome.

Someone actually cares enough to ask me about what are the things that I run into, what are the obstacles that I run into frequently and is interested to do something about it, which is quite nice. Like I said, it’s a journey and we also need to maybe get a bit better at doing a bit of internal advertising and marketing about our efforts because that’s also something that I think tends to fall to the side. You concentrate a lot on the technical aspects and you forgot to tell people about, hey, we’ve got all this cool stuff that actually makes your life easier. So it’s a balance, but we’re making progress though.

Abi: To that last point, do you have conversations… You have the company summit, the all eng meet up and people are like, oh, what do you guys do? Is that kind of something that you hear a lot?

Thomas: Not really, but to sort of try and increase awareness about it? For the bits of the company that we don’t really interact with a lot, we’re actually planning our first develop experience road show, so bring out all the kit, all the toys, and just create a bit of awareness and engagement with the wider technical community.

Abi: Why do you think that matters? Why are you doing this roadshow? Why just not do it?

Thomas: Otherwise? We’ve just spent a lot of time and effort building things that will end up used, but not to their full potential. It’s sort of increasing the clientele, doing a bit of internal marketing. I tell my people, you need to bring out your inner used car salesman a little bit, you need to be able to create awareness and what’s the word?

Abi: Evangelism.

Thomas: Yeah, not quite evangelism, it’s more like brand awareness. Yeah, that’s it. Creating brand awareness, I would say.

Abi: Yeah. Well, I really like that analogy and I think it’s a good bridge into something else we were talking about earlier. So you mentioned that your DevEx pillar within your group is very focused on guidance to the organization, providing subject matter expertise, providing instruction. Can you give a little more concrete picture of how that works and is that partly this brand awareness thing you’re talking about, or is this more really about enablement and just unblocking teams?

Thomas: It’s mainly about enablement and unblocking teams and like I said before, really pushing hard on the reusability and convergence levers.

Abi: What are the activities like? What are these people doing? When you say they’re providing guidance, are they just writing more docs that go into your MkDocs portal or are they, what are they doing?

Thomas: A variety of things. It could be responding to poor requests, it could be responding to personal support requests or seeing someone lost in a slack channel. Say, I’m completely stuck here, or I followed this piece of documentation, but the thing isn’t working as I expected. It could also be Greenfield projects, folks looking to start a new initiative or build a new service and say, I have a rough idea what I’m doing. This is sort of what I’ve sketched together. Can you take a look at this and tell me what you think? Am I missing something? Is there something obvious that I could do? So a variety of things.

Abi: It sounds like then your DevEx team or pillar is a little bit like a support desk for all the internal work that your group is doing. Is that a fair analogy or would you characterize it differently?

Thomas: I wouldn’t say so. Like I said earlier, we were one group of people doing kind of everything, so there’s always that muscle memory there. The main focus at the moment is really working in close step with the SRE squads to really get some uniformity and fixing sort of the more immediate things that we’re facing. With the long-term view to target the wider tech community because you got to start somewhere and we have a local first philosophy, so the support stuff just comes, I guess, as a muscle memory and an innate desire to help people.

Abi: How’s this DevEx Red show actually going to work? Tell me the timeline, the structure. What are you going to present? Are you giving out free candy? Tell us how you guys are actually going to execute upon it?

Thomas: If I’m really honest, it’s quite surprising how well people respond to merch. That’s the thing that, one of the things that surprised me in the last couple of years, folks tend to respond really well to that, so I should probably organize something in that area. To be honest, we’re sort of still in the planning phases of it, but the general idea is mainly to just provide an introduction and an official sort of introduction to the pillars in the team, explain what it is from a strategic perspective, what this team exists, what the challenge is, what are we actually setting out to achieve, and off the back of it, we have a lot of areas that we wanted to introduce people to and sort of do some live demos, be it is this really awesome tool that helps you get your applications deployed called Scaffold. Here’s how it helps you work on iterate locally and it will package and deploy your application for you into a cluster.

Here’s how to use Argo CD, here’s how to use the Terraform registry, here’s how to automate your releases. Here’s this awesome thing called renovate bot, which will update all of your packages and Terraform dependencies and container images and so on and so forth. So really, like I said, just going through the toolbox and putting all this stuff out on display because, and some of it is not even stuff that we necessarily built ourselves, but it’s just more creating awareness like this is our preferred way to go about it.

We have a lot of stuff that can be reused to tackle these particular set of problems, rather than people having to do the exploration and research on their own and sort of reminding people of where to go to for help in terms of documentation sources and where do I go to look for stuff when I don’t know what I’m looking for. That’s sort of the meta problem when it comes to documentation. I don’t know what I’m looking for. That’s the rough outline and idea of what we want to do. So overall introduction, strategic background and picture and lots of interactive demos, short and sweet, let the tools sell themselves.

Abi: Love it. I think before the show, you also mentioned the DORA Metrics. I’m curious how that fits into this measurement and data kind of strategy that we’ve been talking about. Do you guys measure them and what’s been your experience as far as what those types of metrics have shown you?

Thomas: So we are still in the really early days of that. In the end, it’s an interesting data point to look at. The thing is, on a meta level, if you say, okay, why does it take us so long to build and deploy a new service? It’s kind of tough to answer that question if you don’t have a baseline and you can track it over time, right? Having said that, as we all know, looking at quantitative metrics can be extremely dangerous and tend to backfire. There’s a significant body of research to back that. You would also be surprised at the number of, or maybe not surprised at the number of SaaS vendors that operate in this space, that are trying, especially in these days, really actively to gain new customers. Just integrate us to your GitHub org and we will tell you down to the person who’s the most productive in terms of lines of code and poor requests.

Even these qualitative metrics are useful to a degree, but they’re not the be all and end all. I share this opinion, they serve as a good overall proxy to give you an idea, are your efforts bearing fruit, essentially? Are you on the path that you think you want to be going down? We sort of surveyed people about this and asked them, so what are your build times and what’s your perspective on that? Is that good for you or not? Some people are like, yeah, it takes me an hour to do a new release and to get that deployed, and for us it’s pretty acceptable, interesting data point.

Let’s put a pin in that one and get back to it. Whereas other teams are like, yeah, it takes me eight minutes to cut a release and deploy, and I’m very unhappy with that. I need to make it faster. So at this point in time, it’s mainly about actually gathering those metrics and just putting them in the hands of the application owners just to give them something to look at over time, to see if there’s any deterioration or improvement, but not necessarily something that would be used for decision making or from a strategic sense, so to speak. It’s mainly about putting it in the hands of the actual application owners.

Abi: Thank you so much for listening to this week’s episode. As always, you can find detailed show notes and other content at our website, getdx.com. If you enjoyed this episode, please subscribe to the show on Apple Podcast, Spotify, or your favorite podcast app. Please also consider rating our show since this helps more listeners discover our podcast. Thanks again and I’ll see you in the next episode.