So you've done it. No more fires.
You're the dog that has caught the car that you've spent so long chasing, you didn't know what you'd do when you finally caught it.
You can start to feel like you're missing something important. Or that you're being complacent. Or maybe things are just off.
Good news/bad news. This is where you want to be operating at constantly. This is also where the real work begins.
Safety 1 and Safety 2 Thinking
If you're not familiar, this is the work of Erik Hollnagel. At a very high level safety 1 is where you're making sure bad things don't happen. You’ve likely been doing quite a bit of work in order to get to this point, where you have no active fires. This is a good space to be operating in. Coming from aviation, plane crashes are a bad thing.
Now you can take a step back, and think about the system as a whole. What could cause your organization’s plane to come crashing down out of the sky? If you’re just getting to this point there are likely a number of things you can do to harden your system, or things that haven’t caught fire purely by luck. Run a few pre-mortems to see where your system may fall apart. This is a great exercise to carry out with a team, because they're going to see many things you don't. Places where the entire system can fall apart.
Ask your team what they think needs to be done to prevent fires. Unless you’re hands on in the code base with them every day, they likely know the answers to this question better than you do. And even if you’re getting your hands dirty too they’re likely exposed to different parts. They may know about the skeletons in the closet you either never knew about, or have forgotten.
On the technical side this could be as simple as more logging, or a better error monitoring system.
On the people side this could be ensuring you’re having the 1:1s to ensure folks are feeling good about their environment. Are they feeling burnt out? Do they have concerns about getting a mission critical project out on time and on budget?
Start reframing the things that the team sees as painful/fires that need to be addressed. Lower the bar. You want to know about more of the trivial issues that could escalate. Maybe the CI pipeline takes a little bit longer than it should. Do you have some flaky tests? Make sure the team understands the needs of the customer. What can you do to reduce the time between product ideas and deploy, to allow a more continuous flow of value and learning between customers and the organization?
After a while you've minimized the ways the plane can crash, and to really continue reducing the incidents you need to move to safety 2, which is focusing on what needs to go right in order to ensure the plane doesn’t crash, and focusing on that. What are the little things that are going right when your team does their best work? How can you ensure that happens more often?
Maybe your Product Manager provides acceptance criteria that can be tested in a binary way, ensuring the needs of the customer are met. It likely means that your team has uninterrupted blocks of deep work time. It may also mean that your team feels comfortable telling you about when things aren’t going according to plan and they know they’ll have the organization’s support getting things back on track.
There are many things that must go right to deliver successfully. Find ways to make sure they happen more frequently, to ensure your team meets it's goals more frequently.
Take the time to think about what has to be true to make sure fires don't start again. Entropy is the natural state of affairs. Things will start to break down and fires will start if you don't ensure things go right.
Share what you've done when you've gotten things to a point where nothing is on fire. @brit_broderick on Twitter, or firstname.lastname@example.org