Management

How to Objectively Measure Software Developer Productivity

2022-04-2114 min readWritten by Arc Team

how to measure developer productivity accurately

Summary:

How productive is your software engineering team? This article considers a new method for measuring developer productivity.

Tens of thousands of words have been written on the impossibility of measuring developer productivity using quantitative means (i.e., using numbers rather than subjective judgment).

In this post, I want to walk through a thought experiment: if we had no choice but to measure developer productivity, what would be the best way to do it?

But before we dive into this challenge, let’s review the reasons why measuring developer output quantitatively is generally considered a fool’s errand.

Looking to hire the best remote talent? See how Arc can help you:

⚡️ Find the world’s top developers, designers, and marketers
⚡️ Hire 4x faster with fully vetted candidates
⚡️ Save up to 58% with global hires

Hire top talent with Arc risk-free →

Difficulties With Measuring Developer Output

Comparing lines of code written

One of the most commonly mentioned quantitative metrics we could use is ‘lines of code’. This metric treats lines of code like widgets on a production line; the more produced, the better. Developers who write more lines of code would be considered more productive and impactful than developers who write less.

However, there are some real problems with this. What if Developer A writes 2x more code than Developer B, but the code introduces bugs into the system? Or it’s not scalable, overlooks edge cases, or is difficult to read? Even if we assume both developers write code with equal skill, a more verbose solution isn’t always better. Sometimes shorter solutions are more elegant, with less repetition in the code.

Therefore, it should be clear that merely comparing lines of code written isn’t a good way to measure developer productivity or impact.

Comparing the number of commits

Similar to lines of code, some have suggested that we could measure developer productivity by comparing the number of commits they push to version control. However, all the same criticisms of the ‘lines of code’ method still apply.

What if the commits are bad? What if the code being merged contains bugs, or is poorly thought through? What if it’s brittle? A developer who writes more code is not necessarily writing better code.

Comparing the number of features shipped

The biggest challenge with this quantitative metric is that not all features are created equal. Features vary greatly in size and complexity.

Some features might appear straightforward to implement at first glance but are later revealed to be more complex than anyone thought. Therefore, it’s impossible to make an apples-to-apples comparison between them.

So where does this leave us?

It’s clear that there are major problems with the most commonly suggested ways to measure developer impact objectively. As a result, it’s become slightly taboo to talk about the topic. Martin Fowler has said that this is an area where we need to “admit our ignorance”. The battle is lost.

Or is it?

What We Do Now Isn’t Perfect, Either

Yet despite this, developers compare productivity and impact all the time: subjectively. And managers do the same every time performance reviews roll around.

If you’ve ever worked on a team of developers, you most likely have a rough idea of the productivity and skill level of those around you. You have a sense of who’s most skilled at certain technologies, who is most familiar with which part of the codebase, and who you can call on when a core service goes down at 3 am.

We are making dozens, hundreds, and possibly thousands of subjective judgments about the productivity and impact of our colleagues all the time. But most of the time, we’re barely conscious of this process.

In addition to tenure, these subjective judgments are often the basis of promotion in dev teams. They’re the means by which people move from junior to mid-level to senior, and eventually become tech leads or managers.

You might instinctively have a feel for who’s the ‘MVP’ on your team, who are the capable role-players, and who are the weaker links. You are likely to be basing this ‘gut judgment’ on the following indicators:

Does their work need lots of feedback and revisions during code review?
Is their work frequently getting knocked back in QA?
Do they fly through work, or constantly get bogged down?
Do they tackle a mix of tough and easier work, or always end up cherry-picking easy cards?
Do they regularly deliver working, robust features into production?
Are they the person you call when nobody else can figure out a problem?
Do you like pair programming with them, or loathe it?
Do you look forward to (or dread) making modifications to their code?
Do they make everyone else on the team better?
(And dozens of other signals.)

But aren’t there problems with this vague and subjective system, too? For example, a developer who tends to ‘golf’ their code, using one-letter variable names and highly truncated, difficult to read expressions might be highly admired by a tech lead who also favors this style, even if the rest of the team struggles to read and work with the end product. Or, here’s a simpler example: developers are often prized by managers who get along well with them, regardless of actual productivity.

I could provide many more examples, but it’s clear that our current subjective methods of evaluating developers have problems. Not to mention possible biases we might have when evaluating people who don’t fit the stereotype of a ‘good developer’.

Given that neither the objective nor subjective measurement approach is perfect, let’s return to the thought experiment I mentioned at the beginning of this article.

If we had to objectively measure developer productivity and impact, how would we do it?

You can also try Arc, your shortcut to the world’s best remote talent:

⚡️ Access 350,000 top developers, designers, and marketers
⚡️ Vetted and ready to interview
⚡️ Freelance or full-time

Try Arc and hire top talent now →

Start By Assuming That Your Developers, On Average, Write Good Code

So many of the criticisms of objectively measuring productivity are based on the risk that the code is bad. And yet, we have extensive hiring processes and perform code reviews to ensure that the code that we ship is, in general, actually pretty good. What if we allowed this to be a baseline assumption of our measurements? Not that code will always be good, but that it will, on average, be quite good.

But even with this assumption, we’re not out of the woods yet. We can’t simply measure lines of code and the number of commits because the volume of code does not necessarily represent more value delivered to users.

In that case, let’s also make another key assumption: the things your team is working on are, on average, important and valuable. And on average, the features you’re delivering will deliver real value to customers.

If we assume the code is good, and that the team (on average) creates value for customers, then we need a way to objectively measure the amount of value delivered to customers. I think a promising option is measuring velocity.

Velocity

Velocity, an Agile project management term, is a metric derived from the amount of work completed in a Sprint and measured by story points assigned during estimation. For example: if your team completes four cards during a Sprint, each worth five points, your team’s velocity for that Sprint is 20 points. If your team is able to maintain this level of productivity you can expect to complete roughly 20 points worth of work per Sprint.

In other words, velocity is a numerical representation of the amount of work completed. Work items are scored based on the size, complexity, and amount of unknowns involved in completing them.

If you’re writing good code and working on features that matter, per-developer velocity may be the closest thing we’ve got to an objective indicator of productivity and impact.

But velocity is unreliable.

If Developer A churns through 15 points worth of stories in an iteration and Developer B churns through 10 points, does that mean Developer A is more productive?

Not necessarily. All kinds of things can skew velocity in a single iteration. We’ll call them random factors: blockers, vacations, bureaucracy, poor estimation, putting out fires, server problems… the list goes on.

And yet, each iteration’s velocity is just a data point. A single data point doesn’t have much use on its own. But when put together with many other data points, these random factors can be averaged out, or controlled. Over a long enough time period, all team members are equally likely to be faced with these obstacles, or random factors.

Therefore, the effect of random factors will be statistically controlled. Note that this also applies to problems with estimation. Most teams make estimates with a fairly large margin of error. However, over time, these bad estimations should equally affect everyone on the team.

Let’s say that you measure velocity for two-week Sprints over six months. There are 13 data points (Sprints) in this time period. What if, at the end of these 13 Sprints, Developer A had an average velocity of 10.5 points, while Developer B had an average velocity of 15.7 points?

Given our previously stated assumptions (i.e. they both, on average, write good code, and generally work on things that add value), then perhaps we can tentatively say that, in this time period, Developer B was more productive and impactful than Developer A.

There are, of course, a few caveats worth mentioning. These are outlined below.

Systematically bad estimation

Different from bad estimation in general, systematically bad estimation happens when your team tends to estimate certain types of stories more poorly than others (e.g., complex stories, or stories with lots of unknowns). If this affects everyone equally, then it should average out.

However, this presents a problem when trying to measure productivity over time if certain people on your team tend to work on these poorly estimated stories more than others. If this is the case, you should work to correct this systemic issue before trying to measure developer productivity.

‘Grey work’

In almost every development team, there is work that is important, but usually not estimated and/or not measured in the team’s velocity. In general, this is work that helps the team: things like planning, prioritizing, reporting, information sharing, team meetings, 1-on-1s, and brown bag sessions. In many teams, some developers take on a larger share of this work than others. They’re likely to be disadvantaged when measuring velocity if this work isn’t being estimated.

The solution is conceptually simple, but not always easy to implement: track everything. For example, if you regularly put one developer on server maintenance tasks that aren’t tracked because it’s “just maintenance”, the developer is going to be penalized for it. Track and estimate everyone’s work in your team’s iterations. Don’t allow developers to do ‘grey work’ unnoticed.

Vacations and leave

A developer who takes two weeks of vacation during the measurement period will be at a significant disadvantage, compared to a developer who didn’t take leave during this time. You can somewhat mitigate the impact of this by filling in the missing data, for example, you might calculate an average per-day velocity based on the developer’s available data, and assign this for each day they were on leave.

Wrapping Up The Thought Experiment

To summarize what we’ve covered so far, if your team can meet the following assumptions, you may be able to track and use quantitative indicators of productivity.

Code is, on average, good.
Features are, on average, valuable to users.
Estimation error is not systematic. Teams can be randomly bad at estimation, but not especially bad at one type of story that one or more developers disproportionately tend to work on.
All work is tracked and accounted for, including work that is important but doesn’t produce a clear deliverable.
You collect data over a long enough period of time (e.g., six months) to produce an average. Over this time, random mishaps, blockers, and priority changes should, roughly, affect everyone equally, controlling the impact.
You have some way of accounting for missing data (e.g., vacations, illness).

If you have all of these things in place, then you may be able to semi-objectively measure developer productivity and impact over time.

Whatever your findings, remember that they are likely to be highly context-dependent. A top-performing developer to another team may flounder when they’re moved to a different team. So much of our work in teams depends on fit, whether current goals match our skill set, and team dynamics. Productivity and impact is context-dependent and not a static feature of the person.

This method isn’t perfect, but neither are subjective evaluations. The best system might be some combination of the two, where quantitative indicators are used to temper subjective evaluations, and subjective evaluations are compared against quantitative indicators.

If both are in alignment, then that’s a promising outcome. If they aren’t, subjective indicators are a useful fallback. After all, it’s no worse than how we’ve been measuring developer productivity for the last few decades.

Remaining Problems

This method is still flawed. For example, what if the way a developer tackles a small 1 point story actually ends up saving the rest of the development team dozens, or hundreds of hours of work? How do you quantify the impact of a developer who discovers a security breach that everyone else missed?

The outsized value generated simply won’t show up in quantitative measures. I (and you) can likely think of many other examples that would cast doubt on the usefulness of quantitative indicators. It’s unlikely that there is any perfect, objective way to measure developer productivity and impact.

And yet, I’d argue that our current subjective method has just as many potential counter-examples and pitfalls. Despite this, we are not nearly as critical of subjective measures as we are of quantitative measures.

The answer may be to use both subjective and quantitative measures together and make judgments based on both inputs. They are likely to be more accurate than judgments made based on one type of data.

Over to You

Is your team currently measuring developer productivity, and if so, how? Could this more ‘objective’ method work in your team? If not, why not?

We’d love to hear from you in the comments, and thanks for reading!

You can also try Arc, your shortcut to the world’s best remote talent:

⚡️ Access 350,000 top developers, designers, and marketers
⚡️ Vetted and ready to interview
⚡️ Freelance or full-time

Try Arc and hire top talent now →

Facebook X Reddit Pinterest LinkedIn

Written by

How to Objectively Measure Software Developer Productivity