Search “engineering metrics” on Google and you’ll find countless vendors offering metrics dashboards using data from tools like GitHub, Jira metrics, and other developer tools. While these solutions may seem compelling on the surface, an increasing number of leaders are opening up about their actual experiences using these types of tools.
Laura Tacho has been an engineering Director and VP at several high-growth companies including Codeship, CloudBees, Aula, and Nova Credit. While at Cloudbees, she led the roll-out of a Git metrics tool in the hopes of diagnosing and improving bottlenecks across her organization. In this article, Laura shares her experience in detail, beginning with her motivations behind rolling out the product, the challenges she ran into, and the lessons she learned from the process.
Now, let’s hear from Laura.
Editor’s note: Laura’s story was originally published in The Pragmatic Engineer. Read the full story as well as how Laura believes leaders should measure developer productivity today in the article Measuring software engineering productivity.
When I was a Senior Director of Engineering at CloudBees several years ago, I helped roll out GitPrime (aka Pluralsight Flow) to 300+ developers. Like so many of you, I sought quantitative data to help me diagnose bottlenecks and free up the flow of work. On top of that, I had pressure from my senior leadership team to quantify performance and report on it. So, I worked with a few other senior leaders to pilot and eventually roll out GitPrime to our engineering organization.
A spoiler: it did not go as I’d hoped. I made three mistakes: I had a blind spot when it came to the importance of qualitative data, I thought activity data would easily showcase areas for improvement, and I let built-in definitions of productivity and performance overshadow the specific needs of my team.
Here’s what I’d hoped would happen: using quantitative data would finally allow us to objectively determine how we were performing. It would be easy to blow past certain topics in retrospectives as we’d have data to show where the real problems lay. I’d share tidy reports on performance, and I could use the data to defend my budget and strategic investments in tooling. I’d finally be able to see performance issues before they became bigger problems.
This wish list is what you’ll find on most marketing websites for developer productivity tools. They promise to identify bottlenecks and help your team, before it becomes a bigger problem. Keep product delivery on track. Get valuable insights you simply can’t get any other way. And so on.
What really happened? My team felt like I didn’t trust them, and that this data was going to be weaponized against them if they had a slow week. Other managers were curious, but lost interest after a while, when it was clear the insights about bottlenecks simply weren’t obvious, even after numerous helpful sessions with our account partner. The heavy involvement in open-source projects made workflow insights almost unusable, pushing the boundaries of the UI, since not everything was self-contained within our own company. No one patted me on the back and congratulated me for finding the key to productivity and performance.
During this time, the most useful insights into our development workflows – and specifically, what we could do to reduce friction and ship more code – came from 1-1s and retrospectives. My team knew where the pain was; they experienced it every day.
At a team onsite, one of the engineers said the single biggest thing holding us back was the time it took to review PRs. So, we made it a focus for the next sprint. Suddenly, friction evaporated.
He was right; because we were working on so many single-threaded projects, it was tough to get a code review because context switching is so expensive. A dashboard will give you stats about PR turnaround time. But I didn’t need a tool for this; my team already knew it. And this was mistake #1, overvaluing quantitative data. It was really the qualitative data that steered us toward making the most impactful changes.
My teams were already practicing continuous delivery, and we were proud of the investments we’d made into our automated testing, build tooling, alerting, and monitoring. We were deploying on-demand, sometimes as often as 20 times a day, with a very high quality of service. My hope the data from our build systems could easily pinpoint bottlenecks was completely misguided. We’d already addressed the biggest bottlenecks. Sure, there was still potential for optimization, but nothing obvious.
In the absence of obvious bottlenecks to fix, I fell into the trap of relying on other metrics in the tool; either things I hadn’t thought to measure yet, or proprietary calculated metrics like “impact,” to give me signals about my team’s performance. But when I did this, I let a group of people who didn’t know anything about my team’s goals make a judgment call about how these developers should work. I felt bad that our “weekly impact” stat wasn’t going up week after week.
Some of the out-of-the-box engineering metrics directly contradicted our culture. We spent a lot of time prototyping and used feature flags heavily. This meant our code churn was high, which negatively influenced some metrics like “impact.” And since we started staffing projects with 2 or more engineers whenever possible, it was true – and intentional – that the same groups of people were reviewing each other’s PRs. That often showed up as a red flag in these tools.
Knowing what your metrics incentivize is so important, and I learned the hard way not to let someone else make that decision. I was caught in a trap; DORA metrics had shown us there were very clear trends and patterns evident in high-performing teams, meaning I assumed all other parts of productivity and performance could be similarly quantified and standardized.
“When a measure becomes a target, it ceases to be a good measure” is Goodhart’s Law, and I hadn’t fully wrapped my head around how this would manifest in my teams. Did I want to penalize a staff engineer for mentoring other engineers, because her commit count was lower on days when she’d spent a lot of time helping others? This wasn’t the type of team I wanted to lead, and honestly not the kind of team I’d expect any of my team members to want to be a part of, either.
Still, for so many of us, coming up with a dashboard of KPIs and developer productivity metrics is not optional. Whether you’re just starting to implement performance metrics, or if you’re already reporting on them, I’ll share what I’ve learned so you can avoid my missteps:
Don’t let tools make decisions about what’s important for your team. Some metrics in productivity tools are calculated in ways very difficult to defend. For example, Pluralsight Flow’s definition of a thoroughly reviewed PR is calculated on the length of the comments. Of course, you don’t have to look at this metric. But for teams that pair-program or invest a lot of time in group architectural planning, it’s quite a stretch to tell them their PRs aren’t being reviewed thoroughly because there isn’t enough back and forth.
Use metrics to measure system efficiency, not individual performance. Metrics like DORA were never intended to measure individuals’ performance. If your organization is wielding DORA metrics on an individual level – like counting the number of deploys per engineer – these metrics are being misused. This has likely already caused your team to lose trust in leadership, and it incentivizes the wrong things and causes your teams to adapt their ways of working to align with the incentives. Rolling out a tool like Pluralsight Flow, Code Climate Velocity, or similar, in order to get an objective view of individual performance is very misguided.
If there is one thing to learn about performance metrics, it’s this: development teams want to be consulted about their productivity and performance, not just informed of it. They are highly motivated to improve the efficiency of the systems they work with, because they feel the pain of inefficiencies and friction every day. Include them in this conversation, both by aligning with them on definitions of productivity, and also by listening when they share perspectives on what should be improved.
Thanks to Laura for sharing her experiences. There are many solutions on the market today which provide Git metrics and advertise themselves as effective ways for measuring and improving productivity. Laura’s story is helpful for leaders who are considering these tools for their own organizations. To hear more from Laura, follow her on Twitter and check out her Maven course, Measuring Development Team Performance, which guides leaders on how to adopt engineering metrics.