engineering-management

KPIs for Early Stage CTOs

Metrics that you ought to be interested in as an early stage technology leader, for teams, individuals, and the organization as a whole.

Photo by Zetong Li on Unsplash

There are a few metrics that you ought to be interested in as an early stage technology leader, whether you're in Engineering, Product, or the combination of the two, Delivery (borrowed from Jonathan Nolen at Launch Darkly, courtesy of the Developing Leadership Podcast).

Individual Pulse Metrics

There are some metrics that I find useful for keeping a pulse on individual engineers. But, they are only for starting conversations with them and their team, not for using them as the stick to beat them with.

These may or may not be available, or they may or may not be the purview of all individual contributors. Use what your organization makes available.

Technical specifications submitted
Technical specification reviews submitted
Commits pushed
Pull requests submitted
Code reviews submitted

None of these will authoritatively tell you whether or not someone is working. They may be working by pairing or mobbing, which may not show up here. Or they may be doing other work you need to make visible to the organization. Finally, they may be gaming these metrics to keep you off their back.

More importantly, none of these will tell you if individuals are working on the most valuable thing possible.

Again, these are all things you use to inform and kick-start visibility conversations with your team in the event that you have concerns about individual performance on a team.

These can be tracked using something like Code Climate's Velocity product, or using your own hand rolled stat gathering tool interacting with something like GitHub's CLI.

Metrics are for teams (not individuals!)

There are some things that are interesting about the metrics of individuals, but just like in sports, whether physical or electronic, putting a team of all-stars together often results in worse performance than those all-stars on their normal respective teams.

And remember, high performing teams follow power laws. Team performance doesn’t follow a normal distribution. This means that high performing teams are likely to far outpace the average.

You're not looking for high performers individually, but how to build them into cohesive teams that hit their stride, complement each other, and push outside the bounds of a normal distribution through continuous learning..

Engineering Metrics

These are metrics unique to an engineering organization, without the influence of the product organization.

DORA Metrics

The DORA Metrics are table stakes when considering only the engineering portion of the organization. This isn't a holistic picture, and I believe that Delivery, rolling up to a single individual responsible for both Product and Engineering, is the role that makes sense. But if you can only control engineering, you start with this.

These measures are more about the capability of the team, although I think it's also important to consider the actuals. What this means concretely I'll discuss in each section.

How each of these is tracked will depend on your organization's toolchain. Unfortunately, there isn't yet a one size fits all way to track this information.

Deployment Frequency

How often a team successfully releases to production.

This is especially useful in the context of how often your team is actually doing work on the code that will be deployed.

For example if you've got your deployment pipeline setup so you can deploy in a repeatable, automated fashion in less than 10 minutes, but you only actually deploy that code once a month, because that's the frequency with which it's necessary to deploy small updates, awesome. Your team is high performing.

But if you have processes and procedures in place ensuring you can only deploy code to production once per month, you have room for significant improvement.

This can be tracked by multiple services including most application monitoring services, your CI/CD service, or posting data during the deploy process to a centralized monitoring service.

Target range is "on-demand", multiple times per day.

Lead Time for Changes

The amount of time it takes a commit to get into production.

This includes all steps in your process, human or automated, actively being worked, or idly sitting in a queue. In fact, the times where it is idly sitting in a queue are the most interesting from the perspective of this metric. Those flow blockers are often the most serious inhibitors of your engineering organization's ability to deliver value.

Whether it's queued up for manual validation, waiting for code review, or in a merge train waiting for human approval, all of these queued times are waste that should be considered for elimination.

I have really never encountered a good reason for any change to take more than one hour to get to production.

This can be tracked by tools like Code Climate's Velocity, or combining deploy metrics from CI/CD servers and Git metrics.

Target range is less than one hour.

Change Failure Rate

How many times a production deploy resulted in a negatively impacted user experience, divided by the number of times the team deployed to production.

You may notice this pushes you to "gamify" the metric, by going for tons of tiny, frequent deploys. This is a feature, not a bug.

If you're deploying constantly, you can more easily pin point when you introduce an issue. It forces your engineers to work in smaller batches which goes into one of the other metrics of small changes capping at around 500 lines of code. It forces you to have high-quality instrumentation in place because otherwise when you're all those changes out you won't know when you introduce a breaking change.

Finally, it forces you into the healthy habit of extensive use of automated testing because you cannot deploy 30 times a day if each time you have to run through a long manual test cycle in order to ensure the safety of your deploy.

This is one that has to be tracked at least partially manually. You can track much of it in an automated way, depending on if your deploys will automatically rollback when certain metric thresholds are crossed. But it likely also has to be tied back to your issue tracker. You ought to pull the deploy frequency in one of the ways mentioned above, but the count of failures will need to be summed based on a combination of issue tracking and automated rollback metrics.

Target range is 0 to 15%.

Time to Restore Service

How long it takes an organization to recover from a failure in production.

Where a failure is defined as anything that negatively impacts the user. Notice that the impact of the failure here isn't relevant.

Previously known as Mean Time to Recovery (MTTR). This is probably the DORA metric that generates the most controversy, but from my perspective makes sense. It's concerned, although less so, about "Black Swan" style unexpected and unpredictable events, but rather most interestingly tells you about your organization's maturity around mean time to discovery, and remediation through rollback.

If you've got beef with that MTTR, you're likely at either end of the distribution, which can be interesting for informing what you do with this metric.

If your org is mature, you may not need to track this one. If not, you need to get to the place where most of your incidents are discovered quickly, and recovered from quickly, ideally in an automated fashion. If you're not sure which bucket your organization falls into, assume it needs to track this.

Tracking this one is something that I've once again only seen done with a combination of automated and manual processes. Given that you should be able to tie a particular commit to a point in time deploy and/or feature flag toggle, and a resolution to a point deploy and/or feature flag toggle you should be able to gather this data as well, tying it all back to an incident report and change failure.

Target range is less than one hour.

Reliability

The fifth DORA metric is newer, first released in the 2021 State of DevOps Report. It measures operational performance, and is a measure of modern operational practices. Reliability is the degree to which a team can keep promises and assertions about the software they operate.

Unfortunately at the time of writing, the DORA team doesn't define any concrete benchmarks for reliability, but instead gives guidance about sociotechnical practices that lead to better organizational performance against reliability.

Teams ought to:

Define reliability in terms of user-facing behavior
Employ the Service Level Indicator (SLI)/Service Level Objective (SLO) metrics framework to prioritize work according to error budgets
Use automation to reduce manual work and disruptive alerts
Define protocols and preparedness drills for incident response
Incorporate reliability principles throughout the software delivery lifecycle (“shift left on reliability”)

Along with these 5 bullet points, in the 2021 report they explicitly highlight documentation as a key practice, with teams that make documentation and playbook generation a key part of their culture are 2.4 times more likely to meet or exceed reliability targets.

The DORA folks don't list a target range for reliability metrics at this point, or even specific reliability metrics, but this is something that should be gathered automatically as part of telemetry data that you build into the process of developing software, and specific to your organization's needs. Medical device software will have more stringent reliability needs than an attention economy consumer app.

Patch Sizes

Most patches/pull requests should be under 500 lines of code, based on research put out by SmartBear and Cisco, which mirrors my personal experience.

I like to be very opinionated and strive to keep most of the changes under 200 lines of code, as I've found that is the easiest format for most folks to review.

It's important to note here that it’s lines of code all-in in this case. That includes comments, tests, documentation etc. Keeping it constrained in this way helps to keep reviewers focused and attentive.

Larger PRs tend to be harder to review in the same way, either decreasing the quality of the review, or dramatically increasing the time it takes, and the number of review cycles, as a thorough review of a larger PR means a higher probability of finding mistakes that need correction.

All of this leads to frustration and results in a higher probability of the code not being reviewed and merged on the same day it was created, reducing your ability to practice continuous integration.

Defect Rate

This is one that I find useful for existing codebases, because you may deploy to production and then notice a defect that isn't related to the code that you just pushed.

I advocate not counting that as a change fail if it isn't introduced because of the newly deployed code.

If the newly deployed code uncovers an issue that was previously hidden, or changes the performance characteristics of the system in a negative way, that is a change fail.

Otherwise, it's a historical defect. Change fails are defects, but not all defects are change fails.

The defect rate is measured by the number of defects uncovered, whether in the form of change fails, or legacy defects, divided by the number of commits.

This is one that I have only been able to track via the issue tracker, as that's where these things surface and are recorded. I think they're still valuable to retro on, and do an incident report for, using your current software development process as a foil. See if it would be caught with the changes that have been made since its inception.

The target defect rate would be 0-15%, which ties in with the change fail rate, but could likely be padded slightly in the context of a legacy codebase.

Product Metrics

These are the metrics that are specific to your domain, and line of business, these could be:

Conversion rate
Usage of specific functionality in your application
Customer acquisition cost (CAC)
Customer Satisfaction metrics - these are usually lagging, if we can turn them into leading indicators that's always preferable.

These are actually the most important metrics that your teams and organizations are responsible for. Everything else is in support of this.

You want good engineering and delivery metrics in order to ensure the Product needs are being met. Without these dialed in, the greatest delivery team and process doesn't matter. But a solid process will enable you to do this better than your competition.

A process leading to faster iterations is critical. Even if you aren't always doing the right thing, because you're at bat more often you'll likely end up outmaneuvering and out performing your competition.

Delivery Metrics

Delivery metrics are those that are best suited to being measured in a way that crosscuts product and engineering as separate disciplines, if it's within your power to make that call.

Lead Time

Time from the idea begins being worked, to the time that it's in user hands.

Being worked is defined as the point when Product takes the idea and begins discovery work, to validate it as being worth implementing or not.

A second lead time that may be valuable would be the delivery lead time, post discovery. After Product has validated that you do in fact want to build some piece of functionality, how long does it take to design that functionality and get in production?

The best way to reduce your lead time is to avoid queues and handoffs. for example a backlog where you park ideas that Product has done design work for and wait for a dev team to have free capacity. (See the following sections on Queue Times and WIP Limits.)

Queue Times

If in either the Product or Engineering side of delivery, pieces of work are idle in a queue, spend time analyzing that.

These are going to be the single biggest points of failure to deliver in your organization, and are going to be the things that build up a false sense of progress and organizational capacity.

Things that are "started" by Product, and then queued up for when engineering finally gets around to having bandwidth are one of the leading drivers of organizational mistrust, and a lack of sense of progress.

This will also expose where your process needs improved.

If you have a heavy manual QA process you could have changes that are code complete and unable to ship because of queue time there. Either because they aren't at the front of the line, or because the entire queue is blocked by a change that introduced an issue.

The best way to reduce queue times is to introduce work in progress (WIP) limits, moving you closer to a single piece flow.

Work in Progress Limits

Work in progress (WIP) limits are critical to maintaining a flow in a team.

You will always have more work than your team can possibly handle. (And if not, you've likely got way too many folks on the team. See Charity Major's article Every Achievement Has A Denominator.)

Given that, you need to prioritize.

We learn from queuing theory the more work we try to do in parallel, as individuals, or as teams, the longer it takes to get anything and everything done, and the more time it spends waiting in queues, which means by definition it's not delivering value for the user.

Kanban as a system isn't just about the silly board. The board is a useful tool for visualization, but the primary value you can derive from it in software engineering is the fact that it allows a demand based system.

The team only pulls new additional work as it has capacity to work on it. This gives you real, hard feedback on how much your team can be doing at any given moment. Either you've got sufficient capacity to meet the needs of the organization or you don't.

This however, can be incredibly difficult to stick to.

An undisciplined organization, or one where executives outside technology think they can get more done by just making folks work more, will immediately cave at the first sign of organizational pressure.

You'll begin a vicious cycle of taking on more work than you have capacity for and fail to show progress or deliver in a timely manner.

This breaks trust due to broken expectations and creates animosity between "the business that doesn't understand what it takes to make great technology" and "those engineers that are never accountable and can never make their deadlines".

Inventory Levels

Just because single piece flow is ideal, doesn't mean that you should be running a zero inventory just in time flow. You may have the team composition to support that, but you may not.

Your organization likely has different cycle times for each step, for example discovery work being done will often take quite a bit longer than much of the engineering work to implement a simple feature. Or the new feature may require significant work so you probably want some level of inventory built up. How much depends on your organization, but if you’re most organizations, you've likely been erring on the side of far too much, rather than not enough.

If you don't retain some inventory that can result in a bad time for folks downstream in engineering being unable to pull work.

You can see an example of this in the way a number of auto manufacturers cargo-culted the Toyota just in time manufacturing flow and that led them to carrying almost no inventory, which caused them problems with the onset of COVID.

Conclusion

There's a lot to consider as an early stage CTO. This isn't going to be all of it, but hopefully this gets you started on the right foot.

And remember, high performing teams follow power laws. You're not looking for high performers individually, but how to build them into cohesive teams that hit their stride, and push well outside the bounds of a normal distribution.

If you feel like there are some key KPIs I'm missing here, reach out. I'd love to discuss. You can reach me at brittonbroderick@gmail.com.