The science behind DORA

Abi Noda: Derek, I’m so excited to finally have you on the show. Thanks for your time today.

Derek DeBellis: Yeah, well, super excited to be here.

Abi Noda: We’ve done two episodes with your colleague, Nathen Harvey on the Google Dora team, where we’ve talked about what Dora is, the findings, how the findings are used. Today I’m really excited to talk to the man who is behind the scenes gathering the data, creating the reports, doing the analysis, so really excited to give this perspective to listeners. I want to start today by learning a little bit more about your background. I’ve had Nicole on the podcast before who created Dora, so how did this fall on your plate?

Derek DeBellis: Yeah, I’m not entirely sure if the story changes in my mind or anything like that, but I was working adjacent to Dora, trying to ask similar questions using similar methodologies, and eventually they needed someone to sit in that seat, do that method, and I had experience way back in my career just really running with the same methodology that Dora has. It’s always made a lot of sense to me, and we could dive into what that method is later, of course, but I guess I was just right next to Dora doing really similar things to Dora and when they needed a researcher to step into Dora, I guess I was the closest and most obvious choice at the time.

Abi Noda: Makes sense. Well, one thing that was funny we talked about was that you were at Microsoft and Nicole was at Google, and then now you’ve swapped places, which is pretty funny, but share with listeners, and you mentioned you were doing other similar research. What kind of similar research have you done?

Derek DeBellis: I was in Microsoft Office working with Microsoft Research.just so they don’t think I’m stealing Microsoft research cred or anything like that. But some similar research that I was doing was a lot of survey work, a lot of survey methodology. Back then it was a lot more logs too because I had access to internal user data to see how certain behavioral patterns. but structural equation modeling, which you’ll see is the engine that makes Dora churn for the most part, was in graduate school, it was my favorite class easily. And then when you learn something and you really like it, you start just applying it to things. At least I do. Oh, this seems, maybe it’s like when the only tool you have is a hammer, everything looks like a nail. So everything seemed at least approachable using that type of methodology.

So I was always found myself in situations running large surveys and trying to make sense of how certain factors in that survey were impacting other factors and trying not to trick myself too much about spurious relationships. And I think maybe just because I like the methodology so much, at Microsoft I ran a lot of large scale surveys. When I got to Google, one of the first things I thought about doing was running a large survey to understand what the people using our tools were interested in accomplishing, and then that extended pretty nicely right over to Dora.

Abi Noda: What is it that you like so much about surveys?

Derek DeBellis: Well, we could probably get into all their limitations. I guess it lives in this nice middle place of being qualitative where you can get some depth and you can get some sense of what someone is experiencing, but you can move that qualitative hue to scale and you could start saying, sure, I didn’t do a qualitative deep dive like someone, a qualitative researcher might with 10 participants, but I can talk about this at a very large level kind of thing, and I think this is my bias. when you have a product that millions of people are using or you’re talking about a space where millions of people are involved, I just think there’s a lot more variance than 10 people can get at. That’s just my opinion. I think it’s super important to do the 10 people thing. It’s just something nice about being able to try to capture the enormous variability at that population level seems really important to me.

And then opposed to logs, it’s great. Behaviorally you’re seeing exactly what’s happening, but there’s no story to it. You have to do a lot of work and there’s a lot of leaps you need to make to understand exactly what’s going on. And I love that connected with survey data, but I think by itself it’s tough to draw the conclusions I really want to make, which is really, are we providing value to people. It’s hard to look at someone click on a button and say, it looks like it’s valuable to them.

Abi Noda: Well, I’ve also had, not your direct colleagues, but Collin and Ciara from Google who focus on developer productivity research internally and they shared a lot about how they’ve triangulated the logs data and the survey data and interesting things they found. So I would point that out to listeners. You touched on graduate school. I want to ask you, there’s a lot of different paths through which people can get into survey stuff. There’s industrial organizational psychology, there’s psychology. What’s your background? What led you to this place?

Derek DeBellis: The short of the story is I was studying experimental psychology. I had a particular interest in existential psychology, which is what happens when we hit the boundaries around us, like death. When we think about death, what happens to us; when we think about feeling isolated from other people, what happens to us? I got really into that and that led to how to do experimental design, that led to how to ask questions and surveys to people about this. And then realizing there’s not a lot of academic opportunities unless you’re pumping out 35 papers and you have someone who knows 75 different academics that are ready to give you a job. I directed myself towards more industry work. I didn’t know what UX research was. I’m technically a quantitative user experience researcher. I didn’t know what that was, but then I saw the job descriptions, like surveys, some other methodologies that I wasn’t familiar with, like surveys, experiments, and I thought, Oh, I could do this. This is my jam. This is my bread and butter kind of thing.

Interesting, I worked with Microsoft a little bit throughout when I was at UserZoom and just naturally float over and worked there, and now I’m here at Google.

Abi Noda: What is DORA? In layman’s terms.

Derek DeBellis: I think the basic recipe for DORA is figure out where people want to go, what people want to accomplish, figure out some ways that might help them accomplish this, might not, and then try to quantify that relationship. I think that’s the basics of it. That’s really all we’re trying to do. There’s a lot of nuance behind not doing that wrong, but yeah, that’s pretty much, that’s how I think about it.

Abi Noda: And something we were chatting about before the show, what are some of the lines of theory or existing research methods that DORA is connected to? How can people think about DORA in the context of prior research and methodology?

Derek DeBellis: We were actually just talking about this as a team too because we’re trying to understand is DORA really just DevOps? And then people were saying, well, DevOps’ a practice. Is it a philosophy? Because it started off and I think until 2014, 2015, it was pretty scoped into what DevOps was, what their interest in studying, and then DevOps became more of a philosophy than an actual set of practices in a lot of ways. And that expanded and expanded. Now it’s touching pretty much everything. There’s not a theory you wouldn’t be interested in applying here. And by that I mean organizational psychology, like you mentioned earlier, there are so many interesting trends about how people work together to be able to accomplish something. And my background in, I said existential psychology, but that’s a subset of social psychology, that really fits in with this too. How intergroup dynamics and intergroup dynamics, and how people convey and transact information with one another.

So then that gets you into the knowledge sharing literature, which is what we found last year. Teams that can share knowledge with each other are teams that work the best. When there’s knowledge silos and people struggle to get information they need to do their job. Unsurprisingly, a bunch of downstream metrics look very bad. And also leadership. There’s a bunch of literature on transformational leadership and how leadership can support teams, and we’re finding that there’s various ways that leaders can be involved that either enable a team or hinder a team. And then there’s a whole set of literature about maybe not even academic literature necessarily, but just a whole worldview instead of philosophies about how engineering should work kind of thing that this is connected to.

And we’re also really interested, we try to keep a pulse on these emerging interests in the engineering space so that we can say everybody’s trying AI, let’s see if that’s actually leading to what we would hope it would lead to, which is better software delivery performance, increased wellbeing. So that’s all to say as DevOps became more of a philosophy than a set of practices, it started just touching on pretty much everything. Now we’re designing the survey for 2024, and it’s getting more and more general. We’re trying to make sure we keep an eye on some particular technical practices, but there’s a lot of questions we want to ask about the work environment, for example.

Abi Noda: Let’s move into more tactically what your process looks like end to end. You mentioned you’re in the midst of it right now, getting ready for the 2024 survey. I want to walk through listeners though through the whole process here, and you mentioned to me really step one is figuring out the things people care about, the outcomes. So share with listeners how you approach that.

Derek DeBellis: Well, if we’re not trying to study what people are trying to accomplish, there’s a relevance problem that we’re going to have. We want to meet people where they’re trying to go, not where we think they ought to go or something along those lines. So we do that through a mix of qualitative research where we’ll just try to understand outcomes that engineers at different levels or people adjacent to engineers at different levels are trying to accomplish with various, people also at various experience levels. There’s a really thriving Dora community, and if you just sit in one of those talks, you could hear all the things people are trying and all the things people want to accomplish, all the things people don’t want to accomplish.

And then we also try to add, even if people don’t say this, because sometimes it gets left out, things like well-being, we try to make sure that’s there as an end in and of itself. Some people think of burnout as a means. If you have burnout, it’s going to prevent you from getting to the other goals, but we try to make it a goal in and of itself. Well, burnout a goal, something to avoid when you’re doing all this.

Abi Noda: One of the things that’s interesting about the outcomes you measure is that they’re all self-reported. I’m curious, as someone who’s been doing this for a long time, what’s your view on self-reported outcomes versus objective outcomes? I’m not sure that’s the right terminology, but for example, looking at a company’s stock price, whereas in your research you have self-reported outcomes of organizational performance.

Derek DeBellis: I think there are definitely limitations for the survey methodology. I think one of the biggest challenges is just imagine you’re asking somebody to talk about things that are a little bit more distant from where they are. So I feel pretty confident if I ask somebody about how their team is doing. you’re going to probably get a pretty good answer, because they have access to that on a daily basis. Maybe if you ask somebody about how their product is doing, they’ll probably be able to give you a pretty good answer because it’s relevant to them. It’s part of the conversations they have. Maybe they don’t have a lot of metrics for it right now or something like that, so it’s a little vague, but they’re close to it, they’re adjacent to it, they’re experiencing this kind of talk. But then once you push it towards something like organizational performance, there’s a little bit more of a gap, an abyss, a gulf between them and organizational performance.

Sure, some companies have meetings that you could show up to and they’ll tell you about all the stock performances and something like that. I don’t know. I try to go to those and be a good employee. I don’t remember much. A lot of it goes over my head. I usually come out with a sentiment, good or bad kind of thing. So I wouldn’t be a very reliable person to ask about org performance or if I was, I’d be biased maybe by what it’s like to be on my team or what’s going on in the media or something along those lines. And it wouldn’t be a very concrete measure. It’d be much more of sentiment. You can’t escape that. So we have been, the last year or two, trying to move to more, for lack of a better word, proximal measures that someone besides maybe a leader or a CFO or something like that can answer pretty confidently.

We keep organizational performance in because people care a lot about that, but we’re trying to recognize that we’re better off staying closer to home for people so they can answer it with a higher degree of confidence and we have better data. The organizational performance might be just more of a sentiment of what it’s like to work there.

Abi Noda: To some degree there’s advantages to that as well, because stock price isn’t everything. There’s a lot of confounding factors that go in the stock price, and so that self-reported view of organizational performance I think is needed regardless, I think in this type of work. Go ahead.

Derek DeBellis: I was just going to say how much I completely agree with that, and also to that point we mentioned confounds. The further the gap between what that person’s doing that you might want to measure this team’s using these technical capabilities and organizational performance. If you’re in a huge org, there’s so many mediating factors that have to take your work to the overall organizational performance. So maybe if I’m at Google and I’m working on search, there’s a pretty close relationship. Maybe if I’m at Meta and I’m working on the newsfeed, there’s a pretty close relationship between the organization’s performance, but if I’m working on a set of tools for a small set of developers, my team could be awesome, but the connection with organizational performance is going to be, well, that’s going to be a tough line to draw without having a tons of confounds and what’s the direction of causality, because a well-performing organization obviously helps. It’s a lot easier to work in an organization that’s doing things well. So yeah, just wanted to add on.

Abi Noda: So you establish these outcomes and measurement approaches for them. Then the next step is hypothesizing on what the drivers are that would predict or affect these outcomes. And you do that through a variety of ways as you’ve mentioned, but one of them is through literature review. This is something I’ve done as well. It’d be fun to share experiences. What does that process feel like or look like for you?

Derek DeBellis: Yeah, I think one thing that strikes me is if you talk to somebody that’s in the middle of trying to accomplish these things and then you look at an academic article, however rigorous and important that academic article is, it’s a very limited perception into what you’re hearing from people that are actually doing it. Maybe that’s because to have a good academic article, it needs to be scoped and it needs to be specific, and it needs to dot all the I’s and cross the T’s to make sure that this gets published through a peer review process. That’s just one thing I’ve noticed about with the literature reviews, is that the academic articles tend to be so incredibly scoped that I almost feel as if I get more out of just talking to 10 people about what people are trying to accomplish.

That’s not to say the academic articles aren’t useful, because you can look at their analyses and see what they found, and you can use that to help develop your hypotheses and your priors in a much more rigorous way. But if it’s just trying to understand the broader sense of what people are doing right now, talking to people and just sort of listening to what’s going on in the community is almost just as useful, if not more useful. I’m curious about your experience with literature review is.

Abi Noda: Yeah, we’ve done several, and thinking back to the most recent one we did, GreyLit was a big part of that. Talking to people was a big part of that in addition to looking at the peer-reviewed literature. And the other thing that I realized in that process was just how early we are in terms of literature in this space, and you look at other fields and you can do conversion validity studies because there’s four different ways of measuring a concept or a construct. Whereas here, I think both of us, when we’re trying to figure out how to measure something like code review experience or documentation quality, there really isn’t, there isn’t even really a conceptual model defined for these constructs and certainly not like an established measurement model for these constructs. So that’s something that I found interesting.

Derek DeBellis: Yeah. No, me too. That’s a great… I’ve been amazed at some… I was studying, like I mentioned, social psychology, and this was in the middle of a huge replication crisis and pretty much generalizability crisis. It wasn’t looking good for all these previous findings, and they tried a lot to remedy this just because they ran into the problem. So I look at some journals a little closer to this and I think, Oh, they just haven’t run into this problem quite yet, kind of thing where… And that’s not to say that they will, but it changes the method and it changes the peer review process, just the history of a field, I think.

Abi Noda: Absolutely. Yeah. I always tell people I’m really excited for the next 10, 20 years of research, especially as things like developer experience, like that qualitative side of software development becomes a more popular theme or area of interest commercially, I think hopefully we see research also follow and build up a much larger body of literature from which we can work from. I want to move into the next step, which this is going to be one of my favorite parts of this conversation, which is the development of the survey items and pre-testing. Take us through how you guys have done that or do that.

Derek DeBellis: Yeah, so at this point we have a sense of our models that is how we think the world works in this particular space. We also do try to figure out a lot of the confounds. So what might be underlying these effects that makes it look like there’s a relationship when there really isn’t? So I don’t know, an example might be organization size. They might be more likely to do something because of a large organization and maybe they’re more likely to have, I don’t know, a better software delivery performance. This isn’t true, this is just an example, but that could make it look like there’s a connection between the technical practice and that high software delivery performance when it’s not really there. And if we don’t have that confound, we report it and people say, look at this and it’s actually not really there kind of thing. It just so happens those two, the practice and the outcome are listening to the same thing, organizational performance.

So we try to capture these confounds before we do the survey, because if we don’t have them in our data, we can’t account for them and then we’re likely to give biased results. And maybe sometimes my only goal is to prevent us from being very wrong and make us more likely to be right than wrong. But yeah, the operationalization part is what happens here too. So we have all these concepts that are really hazy and like you mentioned, the literature might not have a set of survey items that you could just grab and run with like Trunk-Based development or something like that. There might be not be a set of survey questions that you can learn from just reading a couple articles. And this is the more artistic component of research I think, is when you take a concept and you say, how in the world would I figure out how to measure this concept?

What’s indicative of this concept being present in the world? What does it mean for someone to be burnt out? What would I ask somebody? If I had three questions I could ask you to figure out if you’re burnt out? What would I say? What would I ask that groups together and captures what it means to be burnt out or a technical practice? If I really wanted to figure out what loosely coupled teams looks like, what would I ask you to figure out if your team was contingent on a bunch of other teams? So we go through with a bunch of subject matter experts in the particular constructs and say pretty much ask them just that. What would you ask somebody to figure this out? I’ll give you three questions to diagnose whether somebody, not like it’s a medical thing, but to figure out if this is happening on this person’s team.

And we go through that process. Inevitably they generate 700 questions and we have to, the pre-testing then happens, and that’s where we’re trying to figure out these extremely technical concepts. Does anybody know what we’re talking about? Especially if we’re trying to have a broader audience of not just the 75 people who are experts in this kind of thing. We want to have, everybody who takes the survey should be able to at least say, I don’t know, or something like that to this. So we go through the pre-testing where we’re just really trying to make sure is our survey way too long and we’re trying to make that better all the time, but that’s an uphill battle that we’re always working. Last year we got it under 15 minutes, this year we’re going for 10 minutes. No one likes me for doing that, but I would rather have good data and a little bit than a lot of data, and it’s all bad kind of thing because the person’s like, why are you making me answer 75 questions for?

So is it a decent length? Can they comprehend these questions? What’s the cognitive load look like? When we ask them a question, they might understand it, but are they able to, and this goes back to what we were talking about earlier, retrieve relevant memories about organizational performance. What are their memories that they’re drawing on to answer this question? And then the way they would naturally answer it, do our answer choices capture the way they would naturally answer it, because they might say, their answer naturally might look very different than the seven answer choices we give them. And at the end of the day, we have a few questions at the end of the survey and pre-test just about how easy it was to take the survey, how much effort they had to put into it, and we try to get that below a certain bar to make sure we can put it out there and not make people really dislike us for the survey and drop off, of course.

Abi Noda: Well, you definitely sound like a trained psychologist to me in terms of describing the pre-testing process and what you look for. I think that that’s so important to call out, like you mentioned, cognitive load and comprehension and the ability to recall and actually map to a response, and whether those responses are appropriate and whether that response actually meets the objectives of the item that you’re designing.

Derek DeBellis: Yes.

Abi Noda: I know our process for our research has been much of what you’ve described. We also do pretty structured cognitive interviews where we’re similar to testing for the things you’ve brought up and coding the interviews and looking and refining over time. It’s a fascinating process. I really enjoy it.

Derek DeBellis: Yeah, it’s super enjoyable and it gives you something when you’re actually analyzing the data to be like, I think this is how people are thinking about this. And it also, I don’t know about you, I’d be really curious. I catch a lot of questions that we should never ask, and I’m so happy we do that kind of work because you get stuck in your own world sometimes. Obviously someone would know exactly what this means and obviously they’d be able to answer it, and then you talk to someone, they’re like, what is any of this right here kind of thing. It’s humbling almost. Yeah.

Abi Noda: It is very humbling and it’s so interesting because it’s like you’re testing the human mind, you’re probing, and it’s incredible, the edge cases, the language. I was just in a conversation with an interview subject about, for example, the difference between the word adequate and sufficient, and how that interpretation was meaningful in their comprehension of the question.

Derek DeBellis: Interesting.

Abi Noda: Interestingly, after the interview, we looked up the definitions and there was a legal law. Everything has a strict definition, so the legal definitions of sufficient and adequate are different, but what was funny was that this interview subject’s own interpretation was the opposite of the legal-

Derek DeBellis: Oh, interesting.

Abi Noda: So the depth of the problem is humbling.

Derek DeBellis: Yeah. Well, I think there’s a whole strand of philosophy that is language as usage, understanding how people actually use it versus creating analytic definitions and then just saying, this is actually the word, this is what bird means kind of thing. Actually, let’s just go look and see how people use the word adequate or how people use the word sufficient. And that’s probably our best bet for having people take a survey. We want to understand how it’s used, not maybe the legalese is it’s irrelevant because at the end of the day, if you have 10 people tell you this is how they understand it and this is how they’re going to answer it, you probably want to go that direction instead of, I don’t know, defining exactly what adequate is relative to sufficient or something.

Abi Noda: And we haven’t even gotten into the challenge of cross-cultural or language translation, which is a beast of a process. I don’t know if we even need to get into the technicalities of that, but I do want to ask you, and I shared this funny example of adequate versus sufficient, but what are some of the biggest issues you tend to find in your pre-testing?

Derek DeBellis: Yeah, first of all, a lot of times it takes people a lot more effort than we would hope, especially when it’s somebody who’s, we want them to be in the survey, but they’re not necessarily an expert on this particular technical capability or technical practice. I think the cognitive load that gets put on people to go through a really technical set of survey questions, even if they are experts in it, is tiring. Sometimes it feels like, and they say this like I’m reading a legal document, because we want to be really specific. We want to make sure when they answer it that we have really set the boundaries on what this answer really means. But a lot of times that means you have to write a long thing and explain and have definitions and it’s really taxing.

And after a couple of minutes of going through taxing questions like that, the amount of effort that people can and are willing to put into the survey goes down and then therefore the answer quality goes down, and therefore your analysis means less and less and less. And yeah, I think that’s simplifying the questions that are so incredibly technical because we’re in a technical space, we always run into that problem. It’s a guarantee. I have a pre-test up right now and I’m going to put my money on it that when I get some of the videos back of participants taking it, that’s what I’m going to see a lot of.

Abi Noda: Yeah, it reminds me, you’d mentioned to me that your team or Google helps Google customers sometimes who come to you all and say, Hey, can you help us design and run a survey? How does that process go?

Derek DeBellis: Right now it’s a super informal and very loose process, but a lot of times people will look at the Dora report and they’ll say, great, I like seeing these patterns. We’re going to try to apply a couple of these patterns in our organization, but it would be incredible if we had our own internal Dora. In a way, they have log data, they have how the company’s doing, think of all of the extra stuff they could attach to the survey data that’s similar to Dora. So we get a few people that are a few companies, organizations, teams within organizations that are interested in just applying this method within their own organization. And we try to help, I think a lot of people don’t know about all of the methodological rigor that Nicole set up earlier in all this. And then they come, they’re like, Oh, so what do I do? Do I just do a survey?

And then the answer is there and you’re like, well, sort of. It’s in the data there, but you got to massage it a little bit to be able to get some pretty confident results. So then they, a lot of times bring in a data science team to help and something along those lines. But we love just saying, well, here’s how you ask these questions. Here’s how we calculate the answer, the numeric value to these questions. Here’s some of the analytic techniques we use along those lines. And I’m working actually with someone on the team right now to try to open source a lot of the Dora deep cut stuff, the method, the questions, how to ask the questions, how to score the questions just so teams might be able to get in a better position to do this type of work without being at all, I’m sure they’re not very contingent on us, but at all contingent on us.

If they decide they want to ask these questions and run the analysis, they’ll have the code, they’ll have the questions and they’ll be able to do it on their own if they want to. For some reason, the mood strikes them.

Abi Noda: You’ve joked before that a good portion of these customers who come to you and say, can we just put these questions in a form and send it? And then once you teach them what’s involved, they’re like, God, nevermind.

Derek DeBellis: That’s pretty much, I can almost guarantee that happens about 80% of the time because the report, we try really hard to make it super accessible because we want everybody that wants to read it to be able to read it, but that often glosses over what we’re talking about right now, what’s behind it, what generates this perspective. And yeah, it’s not the most complicated thing you’ll ever do, but it’s not as easy as pressing play. I’ll say that much for sure.

Abi Noda: Yeah, for sure. So you develop this survey, you launch the survey, you go out and get a bunch of responses, then you go into the analysis and the cleaning. Talk about that a little bit.

Derek DeBellis: Yeah. Well, the two times I’ve had a chance to lead the analysis has been a very short windows of about three to four weeks of you have survey data, we have a launch date for the report sprint. So there’s the cleaning process trying to find pretty much people who may have not taken the survey in good faith without screening people out who may have taken the survey in good faith, not letting your hypothesis sneak into like, well, this person probably didn’t try because these two answers couldn’t possibly have been selected. So making sure the responses try to, the ratio of noise and signal, making sure it’s more signal than noise. Then there is we go into an exploratory factor analysis is one of the first things we do after the data’s cleaned. And the reason is an exploratory factor analysis, we don’t input our idea of how these things should group together.

So when we put all these questions into exploratory factor analysis, it shows us how they group together without us providing much of an opinion. Confirmatory factor analysis is when we say test if these four things group together, so our theory is driving the analysis. With the exploratory factor analysis, we can see if our theories naturally fall out of the data. We like to do that because we feel it’s a higher bar, because if you do a confirmatory, you can get a good fit a lot of times that makes you feel like your theory is right, but in reality it’s because you didn’t test out all the other possible combinations of how this data could have looked. And that’s where we go for the exploratory factor analysis.

From there, if our theory is looking good, and especially the constructs that really are aligned with how we’re thinking about it or make just a ton of intuitive sense, even if we didn’t think about it earlier, we move that to a confirmatory factor analysis where we can say, this is how we think these items group together. And then from there, that’s our measurement model. We can say, do we have a good sense of, you were saying before, mental model, the mental model of the respondents and how they are thinking about these various items. And then we could see how they relate. And that’s the structural equation model is understanding the relationship between all the various measurements that you have in there.

Abi Noda: I loved your explanation of EFA and CFA and some of the potential dangers of CFA confirmatory factor analysis seem that happened. I want to ask you a little more detail about how you find the relationships for the structural equation model. Folks listening are probably familiar with what a linear regression is, but there’s a lot of other things out there. There’s PLS, which is a method we recently used in our latest study, In the Dora report there was some mention of leave one out cross validation and Watanabe, I can’t even pronounce it. So give us as an overview of what these different approaches are and what you all use for your research and why.

Derek DeBellis: Yeah, so the first year I analyzed the data in 2022, we used PLS, partially least squares. It’s a type of structural equation modeling that really I think optimizes based on R square pretty much. They explain it, how much of the variance in the outcomes you’re able to account for with your explanatory variables. This year or 2023, the report, the analysis behind that was a bit more scoped. Instead of creating a giant structural equation model, we started doing very scoped hypothesis driven models smaller. And the reason behind that was simply if you create these huge models, in PLS, you just feel as if you’re like Zeus or something like that. You can just make all these cool connections and you’re like, Oh wow, look how complicated this model is. It’s beautiful, but everything you add into the model affects the relationship that something has to another thing. Dustin Smith who used to be the researcher on Dora right before me, he explained it as a solar system, that there’s a bunch of planets and if you add a planet in there, all the gravities change between them.

And so if you have this huge model, it’s not always obvious to you what’s making something change its relationship to another thing. It’s just for a mere mortal like myself, it’s really complicated. I added security into this model and all of a sudden this relationship between X and Y is completely different kind of thing. So we scoped our model, we scoped our analysis to be like 10 small models this year so we could really understand what’s going on and we could think about what’s likely to be a confound, and we could think about what the relationships are and these really more nuanced pathways and not get ourselves confused about spurious relationships that are caused by another quadrant of the model. And I think when we were mentioning in the report, in the methodology section about leave one out cross validation, there’s a ton of different ways that say we have a model and then we want to add another pathway into the model, or we want to take a pathway out and we’re going to compare two models that are really similar, technically nested with the ones nested in the other.

And we want to say, Hey, when I add this pathway, do I get a lift in understanding? Do I have a better understanding of what’s going on when I add this one pathway? One way you could do it is R squared. One way you could do it is leave without cross validation, AIC, BIC. These are all approaches to just saying, I think of it as the variables in your data have, they’re like waves and they’re moving. How much of those waves can you account for? And it’s saying, if you add this parameter to your model, can you account for more fluctuation or waves in the data? Is it enough to justify it? Because the scientific endeavor is like it’s parsimony, Occam’s razor. You don’t want to add things that don’t need to be added. So that’s really the test we’re doing. We have a model. What happens if we make it a little more complicated? Oh, we didn’t really actually learn anything new. Let’s not make it more complicated than it needs to be. And that’s the process over and over again.

Abi Noda: Question that is always on my mind, because it’s not something I’m personally clear on. To you, how do you think about effect sizes? I know there’s differing perspectives on that even in behavioral research. I think there’s the original Cohen coefficient. What is your heuristic around effect sizes?

Derek DeBellis It’s interesting, because I think in the Cohen paper, the one that has all these effect sizes and benchmarks, there’s a really clear point of consulting your literature about what is actually a practically useful effect size. Then Cohen gives you a bunch of different standards, but to an earlier point you raised, the literature that we’re working with, we can’t just go and say, Oh, this looks like a pretty good practical effect size for something. It’s just not there. You’re left inventing the wheel a little bit in terms of what an effect size should be. This year, not to open another can of worms. We used a Bayesian framework and what’s really… I just like because it’s really flexible among other things, and it gives you a posterior of what all these possible values could be. Let’s say we want to know the effect of loosely coupled architecture, pretty much how contiguous your team is on other good teams, the dependencies between teams and burnout, let’s say we want to understand that relationship.

We get a beta weight that tells us pretty much the strength of that relationship. There’s a bunch we could look at to understand how well loosely coupled architecture explains burnout. R squared is a good example of that. Mean square error root, mean square error, AIC, BIC maybe. No, that wouldn’t be that great. But another thing you could do is you could look in the Bayesian framework at the pretty much a posterior of what plausible values of that beta weight is, and you could ask yourself how many of these predictions or estimates of what the value actually is are in an area that’s essentially equivalent to zero? So I think it’s called the region of practical equivalence is what some people call it. So we just say if the effect is between, in this example on a ten-point scale, negative 0.2 and 0.2, or negative 0.1 and 0.1, it’s really not doing anything.

We might, because we have 3000 people in the survey, we’ll probably see statistical significance, but so much of that is in a region that you wouldn’t tell a practitioner to waste their time trying to do this because the gains that they would receive from the amount of work they would have to do are probably small. So we try to look at the effects that are well outside that we’re considering, the region of practical equivalence, which we make up and come up with on our own. And just, because we don’t have a lot of guiding lights, it’s just I think most people would agree that you’re not getting much out of this if it only gives you this much value kind of thing.

Abi Noda: Love that technical deep dive. Last question I have, the methodology has to do with producing the benchmarks. This is something I’ve talked about with Nicole before, but the data we’re working with is not really an interval or ratio data. It’s not even really ordinal data when it comes to-

Derek DeBellis: Yeah.

Abi Noda: So how, I know there’s an art and a science to it, but how do you come up with the benchmarks and just so listeners have context, for something like change lead time, the survey options are ranges between once per week and several times a week. So to then turn that into a hard number, how do you deal with that challenge?

Derek DeBellis: Yeah, thankfully we’re not the first ones to have to deal with it. So unlike maybe some things that are very developer specific, there’s a good literature on how would you work with maybe ordinal, definitely not, there’s no ratio here that’s for sure. How do you work with this type of data in a context where you really want to pretend it’s a number, it’s just not, you know it’s not a number, but you’re trying your best to treat it like such. So I know within the structural equation modeling world, you can pick your estimators that you want to use, and these are pretty much how the data gets optimized. There are some that work way better with this type of data than others. And so once you get it through that process, there’s some literature that suggests once the actual factor score, which is that latent score from the multiple indicators, now we’re really diving into the lead. It can be treated as a continuous number, but that still doesn’t get your question because change fail rate or something along those lines, deployment time, deployment frequency, those are just survey items.

And my approach has always just been how sensitive is our answer to our method? And by that I simply mean if I try four different ways of clustering this data together, do I get four wildly different answers? And if I do, then it’s probably, it’s too up to me to really talk about. So you can talk about researcher degrees of freedom. And if I feel like the answer depends on what I choose, it’s probably not a good answer, but if it’s not super sensitive to what I choose, so I’ve tried four methods, three out of four, the answer looks almost identical, and the fourth one’s a little aberrant. I feel a lot more comfortable than when all four of those methods look completely different. So that’s my approach.

It’s not perfect. There’s still an underlying problem of the data type and the analysis that you need to work through, but I think at the end of the day, you try your best to make your analysis fit with the type of data you have and where it’s questionable. You try three or four different other approaches. And if it doesn’t look like it’s super sensitive to your mood that day, then it’s probably a pretty good answer.

Abi Noda: Really appreciate that technical explanation. Again, to wrap up, I know you all are preparing for the 2024 survey.

Derek DeBellis: Yeah.

Abi Noda: Anything you want to share with listeners about what to expect, how to participate, et cetera?

Derek DeBellis: Yeah, well, we are working on all of our questions right now and I’m going out on baby bonding leave soon. So it’s been quite the rush. It’s exciting. This year we’re focusing on three areas that we just keep hearing tons about. We’re not really changing our outcomes too much, but we’re really more interested in spaces people seem to be really curious about. And those are, you might not be surprised by the first one, artificial intelligence, how developers are leveraging that and if it’s having downstream effects and also what developer perceptions are. We’re interested in the workplace environment, so the antecedents to a developer experience, how your team functions, the practices and philosophies underlying your team, what it’s like to be where you’re working. And then something we’re a little hesitant to get into because there’s so many diverging connotations about what this means, but platform engineering, a lot of people are really interested in that and that’s emerging.

You talk to one person and you talk to another, they’ll have a completely different definition of what it is and what it means and how it relates to them. So that’s why we’re a little hesitant about it just because asking a question about it, the wide array of connotations, it’s just scary. Back to what makes a good research question. If your question means a thousand different things to people, maybe not a great question, but we’re working on it anyway, so hopefully we get some good findings for that.

Abi Noda: I love it. I appreciate that about platform engineering, because as you were saying that I was chuck… I can’t even give you a definition of DevOps and certainly can’t give you a definition of platform engineering, and I know that’s a controversial topic, defining it as controversial in of itself. Well, looking forward to seeing that survey out and seeing the analysis that you’ll be leading. Derek, this has been one of my favorite conversations, getting really nerdy and scientific on how surveys work and how you do the research at Dora. Thanks so much for your time today. Really appreciate the conversation.

Derek DeBellis: Oh, no, I was thrilled to be part of it. Thanks for having me.

Abi Noda: Thank you so much for listening to this week’s episode. As always, you can find detailed show notes and other content at our website, getDX.com. If you enjoyed this episode, please subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Please also consider rating our show since this helps more listeners discover our podcast. Thanks again and I’ll see you in the next episode.

The science behind DORA

Timestamps

Transcript