Chaos Engineering with the Vacation Simulator

Increase resilience through team level chaos engineering using simulated vacations.

Recently I was inspired by Daniel Lebrero's (@DanLebrero) article CTO day 7: Lucky Lotto, chaos engineering but for teams. After sharing it with the team we decided to implement it, but I found that even after sharing the article and discussing, there were a few questions that needed clarification. That's the purpose of this post.

What is this?

Lucky Lotto was adopted as a way to increase the resilience of engineering teams, and reduce bus factor. A form of chaos engineering by forcing the equivalent of paid time off, while not forcing the individual to suffer through the team interrupting their actual vacation time unnecessarily.

Within our team I've opted to rename the concept Vacation Simulator. This better communicates the intention to the team, both for the person selected, and the rest of the team. As with the Lucky Lotto presented by Dan, our Vacation Simulator can select the same individual multiple times in a row,

Why do this?

So that you, or folks on your team can actually take vacations and totally disconnect. Or if that hasn't been problematic for you, consider it a systematic forcing function, remove single points of failure in the least painful way possible.

Instructions for the Team

Every Monday an individual will be selected to simulate being on vacation. You're to treat them as if they're on vacation. You should be doing everything in your power not to ping them unless it is an emergency. If you must ping them write it down and share during retro. It's also likely announcing to the team immediately that you had to bother them in order to allow the whole team to learn about the gap and address it in real time.

If you want to chat with the vacationer socially, about how their weekend was, or want or to conduct regularly scheduled meetings outside the critical path for delivery, such as 1:1, that is still recommended.

It's announced on Monday explicitly so that the individual selected, and their team do not have time to prepare. Hopefully you will have the luxury of time to prepare in the event of an actual vacation, but you wouldn't necessarily have that luxury in the event of another mission critical situation, such as sickness, or familial issues.

Instructions for the Vacationer

Congratulations! You've won a simulated vacation, to ensure that your actual vacations go smoothly! You can attend social meetings, lunch and learn, or your weekly 1:1, but you cannot contribute your expertise to non-social meetings, such as a planning meeting. Unlike an actual vacation you are still expected to work.

So what should I be doing instead?

Form a plan, and alert your manager that you'll be working on things the Product Manager may not be prioritizing that you or the rest of the team feel is important, such as:

  • Fixing up a particularly problematic area of the codebase, with high levels of churn
  • Pair on unfamiliar area of codebase
  • Help others complete their projects, allowing them to level up as well
  • Work on a conference talk or blog post
  • Improve ergonomics or documentation, allowing for a better experience for others on the team
  • Research new technology that may address a need that is coming down the pipe for the organization
  • Take a course or work through a book
  • See if there is something outside your department that the company could benefit from you can complete inside a week

If you don't have any ideas, it's recommended to ask the team. I've found most individuals have a list of things they'd love to get to one day, and would likely be thrilled to have you address them. You may even consolidate multiple individuals' private lists into a list that anyone on the team can grab from or use to recall when their time comes.

If you must do something critical, or are brought something critical from outside your team, note it to discuss during retro, and attempt to grab another member of the team to pair on the issue, addressing it while simultaneously transferring knowledge. If brought directly to you, this may be a good time to note it, allowing you to address your team's APIs for the rest of the organization to communicate to the team.

Takeaways

So far it's been a great opportunity for folks to take the time to introspect about their work, learn something new technically, or about the domain, and for the team to discover painful single points of failure. This last one was particularly important, as it moved us the realization that one powerful tool when there is an individual that is currently a single point of failure on the team from a technical or domain perspective is to use the following week's Vacation Simulator to have the vacationer pair with the previous vacationer in order to better and more quickly address the issue. It also gives you some excellent concrete issues to discuss and address during retro to continually improve as a team.

All in all, I'd recommend the Vacation Simulator as a tool in the belt of any line manager looking to increase the resiliency of their teams.

If you give it a shot, shoot me an email or a Tweet. I'd love to hear about your experience with it.