Abi Noda is joined by Crystal Hirschorn, who leads Platform Infrastructure, SRE, and Developer Experience at Snyk. In their conversation, Crystal shares the story behind the recently founded Developer experience group, including why they named the team Developer Experience, how she calculates the cost of the problems they solve, and how they partner with engineering teams.
Abi: I know you’ve been at Snyk now for almost two years and just recently formed a DX team. I think you mentioned three months ago. So I'd love to hear about the impetus or journey your group went through to start this team.
Crystal: Yep. As you said, I've been at Snyk for two years. It wasn't the team that I started off with. We were a very small group of just two engineers in the earliest days, and now we have three teams. DX or DevEx is our most recent team, as you said, only starting three months ago. We went through a journey to try and get to this point. It's not just about, is my part of the org ready, but is the wider R&D kind of organization ready for a developer experience team? And what does that mean? What's their remit, what's their charter, what's the vision for that team? It's kind of right time, right place, and the reason why it was the right time, right place is because Snyk has been growing really rapidly.
We started off... Well, when I started two years ago, I should say, there were about 100 engineers at that time. Today we have about 350 so it's been quite a rapid growth in a couple of years. The other exciting piece is that underpin that, as well, is that my cloud platforms team built an internal runtime and developer platform which we call Polaris, which powers all of Snyk and it means that we've gone from one production instance to many, many instances. Now we have a second multi-tenant region as of a couple days ago which my team launched and lots of single tenant instances for big enterprise customers. So it's like, okay, well now that we have all of that, we probably need to think about internal developer experience for this.
From your perspective, has the creation of this DX team been more of a rebranding of things you were already doing? Or was this a new investment?
I would challenge that a DevEx team is an entirely new investment. The reason why I say that is because I've been in engineering now 20 years and what I have seen time and again in every organization I've been in, is that you will often have, especially in the early days and when you grow rapidly, you kind of have this hero culture where engineers – It's a bit of a derogatory term, but I think it sums it up nicely, which is the janitorial engineer who really cares about something in particular, like the CI pipeline or good standards or good testing, and then you'll find as you grow that you often don't have those core teams around those practices. And then you really need them.
And so then you start to think organizationally, how should we define that, carve out that remit from things that aren't really properly being invested in in the wider R&D organization? So I'd say that I would challenge that it's ever a completely new investment, but for Snyk, it's never had a developer experience team. It's never had a productivity team. It's never had an engineering enablement team. So it's kind of forging a path for, hopefully, more of those kind of teams to come.
That’s fascinating. And you mentioned before this call that there was an intentional naming process for the team, to call it developer experience. What was that process?
Yeah, I mean I considered a lot of names, and the reason why I chose developer experience in the end is because I felt like it encapsulates the spirit of what we want to do. It's to try and improve the lives of our engineers, make their lives better, develop faster, have more joy in their day-to-day work as well, and so it centers the developer in the mission of the team. There are other team names that could have happened like engineering productivity, production operations. There's lots of different namings for this.
The thing that I would say, though, is developer experience is sufficiently vague that people will think it's all things to all people. And one of the things we did to counteract that, because that did start to happen a little bit, is that we had just done a monthly presentation to all of R&D two days ago to say what is DevEx? What's our charter? What have we done so far? What will we be doing for you and what do we care about, which is you the engineer, and please come talk to us. So that was a really nice way of framing what are we doing and what's our remit, early on.
I’m glad that’s top of mind for you because that was the next question I had. You mentioned the mission of the group is around making developer lives better. Can you elaborate on that charter?
Yep. Yeah, so the couple of things that we've done so far is we've got our mission statement, which is always a good thing and it's something like making Snyk a pleasure to develop within and intuitive and safe paved paths, so those are two statements that came together. And the cool thing that we did there, as well, is that we said, "Okay, it's not just about what we perceive we should be. We should go out and ask a bunch of engineers about our mission statement when we get to a place where we think there's a draft." Yeah, had lots of great feedback, even just on that, just to say, "Mm, this word doesn't quite make sense," and, "Hey, have you thought about this?" And, "Actually this is what I'd quite like from a DevEx team." It's interesting and challenging coming up with a really short mission statement, but it was good to get that feedback and land on something that everybody felt good about as well.
The second thing that we did most recently is domain modeling, so domain driven design. We have just finished our entire group's domain model so all three teams did vent storming and that was really good because that really clarified, not only which team owns what, but what is the remit of DevEx? It could be so many things, and what does it mean for this company? It's always going to be contextual so, for us, it was like centering ourselves around the developer platform that I mentioned earlier and making a great experience around that, and also we want to bring something like Backstage, Spotify's Backstage, which is a open source project, into the mix as a developer portal and that will be another big piece of their remit.
But that also extends to other things where, you know, why I bandied about these names, which is like CICD having ownership around that but developing the tooling that makes CICD really intuitive and easy to use, lots of observability insight to it as well, because that's where we spend a large portion of our working day as engineers, and often it's fraught with problems in terms of understanding, debugging what's going on at that step in the process. And so, yeah, just trying to make that much more robust.
As you went around the company and shopped the mission or charter with developers, what kinds of questions did you hear? And what did you learn was impacting developers most?
So I guess, because it was a charter, we haven't done what I would like to do which is in internal surveys. That will come next. And that's something I think is a really important aspect because you need the qualitative data and feedback from the engineers, but because this was more positioned as a, what do you think about our mission statement specifically, we didn't have as much feedback yet. But having said that, we are pairing with lots of teams already so we're trying to see exactly how does the work happen on the ground. We also have a, what I call the front door, which is our internal slack channel which is an ask channel and so we can start to see what are our engineers bringing up in terms of frustration points or things that are kind of thematically coming up again and again where it's like, well, if we just put a bit of automation here, we could solve this problem and nobody would ever have to think about it again, because a huge part of DevEx is reducing cognitive load.
And other things is that... We had talked about this before, but instrumenting some door metrics as well into the pipeline, the CICD pipeline. Actually all the way through, in fact, so that we can see what do the engineers perceive in terms of the things that we're trying to track through DORA like change failure rate, or how long does a commit take to get to production versus the actual, and how often does that fail? So that's been really interesting and insightful because we've found, as well, through doing that that actually CI pipeline that runs through our monolith, which we're in the middle of decomposing, fails more often than people had perceived potentially. And so it was good for us to distinctively break down, okay, well what does failure actually mean in a CI pipeline? What are those states and is it flaky tests? Is it actually build a failing? Is it some sort of race condition where there's different pipelines competing and commits are getting out of order? And try and break that down so the engineers can see that more specifically.
And then that's the kind of thing, gathering that data and then going back and validating with the engineers, like, "Okay, this quarter, we think these are the top three things that we should be solving," and then going out and validating whether or not engineers actually think these are areas that need the investment right now, or if it's something else that's actually much more painful.
That's a really fascinating story. When you said that the perception was actually better than the reality, what was the perceptive data point?
Yes. I mean, monoliths are interesting because it means that everybody has to flow through them in terms of their development process. And there's always angst, I think, with monoliths regardless and that always comes through a bit, but what's interesting, I think, is nobody had perceived just how often our pipelines were failing around that monolith, as well. People kind of knew that, yes, sometimes it's annoying and sometimes it's bad, but also I think it's shined a light on some things that we probably should spend more effort on and more investment around. Things like flaky tests, for instance. It's a broken window scenario, right? So there's a lot of that that happens inside organizations and it's like, okay, people become adjusted to the way things are, and because nobody has a specific remit to fix this problem, hence why we should have a team like DX, is that people start to get adjusted to the broken window.
They're like, "Yeah, that window over there is broken, but it's been like that forever." I think the same thing can be said about CICD pipelines often, and it just made me think, okay, well, we can shine the lights on where our practices could be better but also specifically say here we probably need to focus a little bit more on our testing practices. It's probably not okay that the same tests fail over and over in the pipeline because that affects everyone. Anytime there's a delay in that pipeline, you're actually causing a delay to everyone else so it's that cost of delay mindset.
And I know you have a story on that cost of delay model that I'd love to get to, but first I remember you telling me that you got started with DX at Snyk that builds were failing 70 to 80% of the time. Was that the number you shared, or is that kind of where you're at on the journey right now?
Yeah. It's one of those things where when we said... We were trying to be very specific about facts, right? We didn't want to put too much of our own judgment on top of that, because that can be really fraught for engineers, especially a DX team trying to establish itself. But it was like let's just put some insight on this. So we said, "Okay, well we can break these down into different types of failure states." And yeah, so what happened there? Yes. About 70% of the time, it was what we considered a failure state. And so you think, "Wow, that is significant. That's a lot of cost to us."
That means also that's a higher degree of pain for our engineers and they're going to be frustrated a lot so it's like what can we do about this. Okay, well we can take this data to the SVP of engineering and the chief product officer and say, "Hey, it's a problem over here." Probably need to invest in this somehow, whether that's through a working group or another mechanism, but we need to make that time investment and that tooling investment to make the lives of our engineers better. And I'd say our SVP of engineering has been a really great sponsor of a lot of this work. He's actually been pushing for this for quite a long time, so it's also good to have that advocate at that level and that sponsor who can then make those decisions around, okay, reallocation of resources to try and fix problems like this, because they're systemic problems essentially.
On the topic of going the executive sponsors and making the business case for investments in improvements, how have you approached that? I know you shared a story of this cost of delay model. So I'd love to hear more about that as well.
So I think what I was saying before is in my previous role, I was a VP of Engineering and so I sat several layers above the day-to-day work of engineers, but I still tried to be very accessible to the engineers anyway. So I sat on the floor with them and would go around and talk to them quite a bit, and I would have engineers come up to me and say, "It's frustrating how little time we're spending on technical debt. It feels like product just wants to ship features all the time." And I just thought this conversation never gets us anywhere. I've been in engineering a long time and there's no data here. It's a conversation that's fruitless and it often leaves engineers feeling like they're not heard. And so I thought there must be a way that we can try and measure this, quantify this, in some way and I need to try and help coach some of these engineers, because some of them were my lead engineers, as well, and so at this level we need to kind of work out how you can start advocating for this type of work and this class of problems and putting that alongside product work as well.
And so we came up with the framework of trying to measure what we could and trying to say, okay, well let's write a one, two-page business case, and one of the things that I shared with you before, which we'd done at Snyk... This wasn't at my previous role but at Snyk, which I thought was really cool, which is we built a model called the Cost of Delay and one of the books that I reference when I talk about this is a book called How to Measure Anything, and it's something like defining the intangibles in business and it's a great book, it's a really great book. I would recommend it. It might look like a dry read, but actually it has completely changed my perception about measuring at work, because often we'll throw up our hands and say, "Well, we can't measure that," and actually it turns out you can, and this book is very good at making the argument for that, but also giving you the tools and the models to try and try and measure something because it's all about reducing the amount of either risk or the quantity of unknown, right? So it's like trying to get you closer and closer to something that's the best estimate.
But anyway, we brought this Cost of Delay model and it basically quantified things like how many engineers do we have? We use Circle CI so our runners, how much does that cost every time we perform a CI run? What are the docker images that we use there? What's the cost of those? What's an engineer's average salary? Things like that, and then saying, "Okay, well if we have 100 engineers, because we wrote it back in the day, what happens if our build takes 20 minutes? What happens if it takes 30 minutes, 40 minutes? And then we also have looked at things like 10 minutes, five minutes as well. And then you start to see just how much money you're wasting just by waiting, just wait time, because that's dead time. Engineers, they often don't go and do something else or if they do, then they've completely context switched and then you're losing again. But that really was a really amazing model to then use to management and say, "Listen, if we have 300 engineers," like we did, say, six months ago, "if your build takes 30 minutes, you're losing about a million dollars every..." I don't know, I can't remember if it was every quarter or every year, but it's a lot of money.
That is a lot of money. And you’re working on another model that’s similar, like a cost of incident model, right?
Yeah, so we're trying to see how we can apply this to different areas of practice, so I also have the SRE team under my remit and we're looking at, okay, when we have an instant, how do we quantify the cost? Because often we'll talk about customer impact and we might say, "Well, these customers were impacted and this was a level of impact because the API was down for X amount of minutes," but that doesn't necessarily tell you, again, a dollar amount in terms of what are we actually losing in terms of people's time that are spent responding to the incident, trying to figure out what would be a reasonable measure in terms of reputational damage, that sort of thing that happens during incidents. It's a harder thing to quantify, I'll be honest, but again, I'm determined that we'll find a model that gives us something, just like we did for our CICD pipeline.
I love your cost of delay and cost of incident models that you’re working on. They're extremely useful for creating a business case around things that are visible, but maybe not easily quantifiable. So I'm curious to know how you're approaching organizational problems in a similar way. Like tooling issues are fairly straightforward to understand and measuring quantify, whereas the organizational issues can be a challenge. I actually have this example, when I worked at GitHub, we were really focused on our monolith, build times specifically. And as part of our dev efforts, we actually went across the company and started just talking to leaders and developers about what was slowing them down. And despite the fact that we were really interested in the build times problem, one of the things we heard when we asked these teams, what's really slowing down your lead time? They would share things like requirements, quality, or the product management process and churn in their process. And so I'm curious about your team's charter and if and how you're thinking about approaching some of those more organizational problems or cultural problems.
We have had a kind of an organization pop up, not that much longer ago than DevEx, kind of ProdOps and Jobs. I don't know if they specifically call themselves that, but they definitely do function a lot like that. So I think that there will probably be a tight relationship with them in terms of when we go out and collect information from teams, because we'll be doing a lot of qualitative research and working with teams. I'll also say, not just working with engineering, but also with products because I think that's fundamental is that we need to understand both sides of the argument for why things might be painful, might be slow, things that we can improve.
And so we'll be speaking to both sides of R&D on that front and then bringing that feedback into probably more like Eng and ProdOps to look at areas of investment and where to spend that kind of effort, trying to fix process problems. Specifically the DX team right now because it's so new, I don't know how far into that area it will go. I would love it to, and the thing I would say as well is it starts off as one team, but I could really see DevEx becoming its own group at Snyk comprised of more than one team. But for now, it's working closely with this Prod and Jobs team and trying to help them figure out what to put on the table next.
Building on what you shared about how you’re thinking about approaching some of these organizational problems, how do you see overall the evolution of the developer experience function over time?
So I guess the way that the team's remit is set up today, as well, is that because I look after the infrastructure group at Snyk, so some people might look at that org design and say, "That's a weird place to put developer experience," because it sets alongside cloud platforms and SRE. And what I would like to see it evolve into, like this analogy of give your Legos away a little bit, but I think that it should become its own group because it needs to look at the developer experience holistically, and right now, the way that the team's remit is set up, is that it's focused a lot on infrastructure layers, this internal developer platform that we've built CICD, but I think it could expand a lot more because there are probably lots of needs on the front end, for instance, that we won't be serving immediately.
But also, having just done this domain modeling exercise as well, it's been interesting because I think while it's the team that is being spread across quite a few different areas and systems, for instance, in terms of ownership, there's also questions around GitHub ownership and how do we make sure we've got the right kind of information architecture there, and that people are getting the right access? And there's other questions around... One of the most recent things was around our feature flagging capability at Snyk and should DevEx be responsible for that, or at least be involved in that? And so these are all areas I'd like to see them actually expand into actually.
And like you said before, that last question you asked me as well, that's absolutely an area I'd like to see them move into is the more structural issues of the day. It's not just about tooling. It's absolutely about people, process and tooling and so we need to look at all of those holistically and find what I would say, use the systems thinking. Like where are the bottlenecks? Where are the inefficiencies? To be frank, what makes their day feel like it sucks and make it better.
What is the landscape of tooling at Snyk? Aside from, you mentioned the internal developer run time and the monolith and the builds… Is there like a long tail of other things that the DevEx team should be potentially owning or driving?
There's not so much of a long tail, at least that I'm aware of. These things are always hard to have a complete handle on when you have this many engineers, right? That's kind of the curse and the beauty, but I think my vision really is about paved roads, paved paths and focusing more on that, providing the kind of the paved road that other engineers can follow, but also then providing ways that they can step off of that in a safe way that doesn't also result in a proliferation of shadow engineering, shadow IT that we then can't get a handle on. But also even from a spend point of view, right? One of the things I saw when I first came into Snyk was my SRE team was running an in-house Prometheus stack, which unfortunately fell over a lot. And I just said, "Let's outsource this. Let's move this to host adoption because this is not a good way to spend our time."
But because of the fact that that platform was so unstable, other teams had started spinning up Datadog, but then you're like, "Wait, hold on. We're paying for the same capability twice," and they're just getting slightly different outcomes from it. And, you know, one team thinks it's more stable than another and so how do we reduce that kind of surface area in terms of all the different systems and capabilities we could have? And I think infra drives a lot of that actually, and I think developer experience can also say, "Okay, well here's the paved road. We're going to make this so easy for you and so good that you don't want to do anything else unless there is a really critical reason for doing that." And that's what I think developer experience is a lot about, is providing that, like I said, cognitive load is really lowered, but also it's out of the box defaults, just get up and running.
Looping back to so kind of looping back to your original charter of making developers lives easier. How much of that do you think your team can impact as opposed to things that local teams need to just do within their local area of the code base or their local environments? Where does that sort of responsibility or potential lie?
So I would say developer experience exists everywhere. You don't need a core team necessarily, like developer experience can never be owned by a single team. And so what I think we do by having these core teams is that we bring the practice in and we say, "This is what good looks like." Getting teams to start adopting those practices and advocating for them, almost becoming like a set of champions across the organization that drive that same culture and practice within their own teams, because we can never be in all places one time. We're just too small compared to the rest of the organization.
And I would also say, if these type of areas are getting too big, it's probably an anti pattern. It's probably an organizational smell that is then becoming the bottleneck to doing these kind of standards and practices. But I would say it's also about what they call nudge theory, nudge culture, which is like if you see an engineer doing a good thing over here, then other engineers are going to be like, "Hey, I want to do that good thing," and so that's what I would like to see is naturally, it just kind of organically turns into a culture where we want to drive better standards and practices through the work that we do, and I think DevEx can bring that.
So what's the state of Snyk today as far as whether local teams are aware of their own developer experience? Are they making sufficient efforts to improve it?
That's a good question. In some cases, yes, and in other cases, no, probably like every company. Sometimes you'll see, especially in an organization this size where we have about 50 teams, some teams are very much caught up in the, "We need to ship stuff. We just need to ship, just need to ship, just need to ship." And then there's other teams that are taking a much more critical look at, "Okay, it's not that at all costs, and also it comes back to bite you quite quickly as well, so what have some teams done?" There's this really cool concept that's happening in my area, the wider platform division, it's a process called shape up. I think it came from base camp and they started doing something called cool down sprints, which I think is actually really cool and something I'd like my area to try and adopt, actually.
That's one of the things that came out of shape up because there's this book called Thinking Fast and Slow, and that book really got me thinking about there's the doing time, and then there's the thinking time, and engineers can get caught in nothing but doing time. And then they don't pause to reflect on the constraints that they have, the trade offs that they're making, and where those investments should really be made. And so I do think we could see, even from a process standpoint, that kind of wider adoption of, okay, we purposely take a pause whatever cadence, six to eight weeks, to consider the next traunch of work ahead and what should we do about that? How should we approach it? Should we be carving our time investment in a different way, as well? Yes, we still have to launch some product features, but perhaps some portion of that is allocated to quote quote tech debt, right?
And the other things that I've seen cropping up at Snyk is a lot of guilds, which has really been, for me, a good thing to see because it's a place where we can say, okay, there's a class of work here that seems underinvested in that we would like to be better at, that the engineers that opt into that guild can spend a portion of their time on it. And we've also said that all guilds have an executive sponsor, as well, which is good, so that we make sure that there is buy-in to that time that they spend.
This cool down process, is this something you’re seeing as being implemented org-wide?
I think it might be a great thing to try and look at org wide, and this is where when I talk about nudge theory, as well, it's like we do these R&D monthly sessions and we do demos and stuff and they're not always just about the tech stuff that we did or the products that we launched. It can sometimes be about behavior, culture stuff as well. And this is where it would be great to see a presentation around stuff like this, like why did we implement this inside our platform division and how might that help you? Because I think that's how it starts. My area, for instance, has driven a lot of different norms and practices that got adopted across Snyk, like some processes around architectural design, ADRs, some practices around story mapping, et cetera, et cetera. And I do think it starts with one team and it can evolve to become org wide.
You mentioned there's some teams at Snyk that probably are more aware and focus on developer experience and others that aren't. Do you see it as your team's job to sort of evangelize the culture around caring about developer experience? Or does that need to come maybe even from higher up, from product leadership or even just general executives?
So that's a really interesting question. I think it has to come from all of those areas that you just mentioned, is the truth. Because the more the message is there, the more it's present, the more it resonates, I think the more likelihood there is for success. And so there is already a lot of sponsorship, like I said. It's the kind of SVP, chief product officer kind of area, kind of level, I should say. Executive level, still working on that, but also Snyk's an interesting organization because it's a technical one. We build tools for engineers, right? We're almost like a developer experience company. We build security tools for engineers, and so we have a DevRel team in the wider open source community, so it's like taking those same kind of evangelistic practices and just applying them internally. And so we also have started working with that DevRel team as well to kind of say like, "How do you work?" and, "What could we take and what could we try?"
And I think one of the things I would touch on that I tell my own teams, which I think even strands a greater chance of success, is advocating for yourself is okay, but finding your early adopters, your zealots, to then go and shout about the great work that you did, that's so powerful. That's where you want to end up is getting those couple of internal teams to go, "I was an early adopter and I'm just going to say this is amazing." And we've had that happen, actually, for our platform a couple of times, as well, where we've had teams say... There was a particular example like, "There wasn't this particular cloud resource available to me so I created the Terraform module. They had a playbook. Followed it. Did it. It's done. Took me like two hours."
And for a lot of people, they thought that was an impossible feat so it was good to have that person go to a demo session and actually demo something about our platform for us. It was a demo all about something he'd built for a platform, kind of extending the ecosystem. And so I think that's a much more powerful message than us going, "Hey, look at our great platform," or, "Hey, look at our great standards." Yeah.
Well Crystal it’s been such an insightful discussion, I really appreciate you coming on the show today. I really enjoyed this conversation.
Thank you so much for having me. It's been a real pleasure to share my experiences with you.