Everyone knows the rituals. From sprint planning to grooming, we try to make sure a sprint is run as effectively and efficiently as possible. This allows project stakeholders to plan for feature releases and ensures that the engineering team is being utilized to their full potential. The sprint has started and everything is going to plan. And then a bug ticket comes in. And then another. An engineer is pulled away from a sprint task to help evaluate the problem. The engineer discovers a bug that is top priority to fix and they switch focus to ensuring the bug is addressed as quickly as possible. Suddenly, the sprint plan is invalid and the PM is left with the, often difficult, task of figuring out which tasks will get pushed out of the sprint to accommodate these unplanned events. This may involve dates being pushed, and sometimes difficult conversations with project stakeholders.

Before we jump into the solution, I want to define a term for everyone.

The hidden cost of context switching - Anytime someone diverts their attention from a task to another task, and then switches back, there is a significant cost added to change mindsets and gain focus on the other task.

Once the fire is out, the engineers incur some of the hidden cost of context switching.

How can we guard against derailing the sprint like this?

Enter: The firefighter

You might be thinking What does a firefighter have to do with software development?

The concept of an engineering firefighter is simple. Imagine the following workflow for a volunteer firefighter:

  1. Someone calls 911 to report a fire

  2. The fire department is dispatched to the fire

  3. The volunteer firefighter leaves their current task to respond to the fire

  4. Once on the scene, the first unit will “size up” the scene and evaluate the need for additional resources

  5. All required resources work together to put the fire out

  6. Cleanup

  7. The firefighter returns to their normal life

This is a role that I’ve seen used in multiple workplaces. It can be adapted for many different situations, but the concepts are the same. An engineer is assigned to the role as part of a rotation. The needs of your team can dictate how long someone is in the role, but I’ve seen it generally as either 1 week or the length of the sprint. For the most part, this engineer is still performing sprint tasks, but their time is not planned during sprint planning, of course this many need to be adjusted depending on team size or the exact makeup of your team.

But now the team is over delivering!!!!

You might be right! But, which of these is an easier conversation to have with the project stakeholders:

Great news! We are ahead of schedule and we can deliver early!

Or:

Sorry, but we are going to have to push this deadline again :(

Maybe you don’t block off all of their time, it depends on the support workload for your engineers. I’ve seen this role best utilized when it is the full-time responsibility of the engineer assigned to it. When the on-duty engineer is addressing sprint tasks, they will usually stick to smaller tasks that they can easily divert their attention from without losing significant productivity to the hidden cost of context switching. Developing tooling to ease support tasks is also a good use of this engineer’s time, if your team makeup allows for it.

So what does a day in the life of an engineering firefighter look like?

Adjusting the roles above, we would end up with something like this:

  1. A bug ticket comes in

  2. The on duty firefighter is notified of the new ticket

  3. The firefighter stops what they are doing to respond to the fire

  4. The bug is triaged to see if it is something that will require immediate attention, more engineers, other teams, etc.

  5. All required resources work together to put the fire out.

  6. Post-mortem, incident retros, etc.

    1. Should be done while the incident is still fresh in the mind of the firefighter
  7. The firefighter resumes sprint tasks.

How does this help?

To put it simply, having an engineer dedicated to this role changes the state of mind for everyone involved.

  • The engineer who is on duty will have the mindset of being ready to respond.

  • The remainder of the engineering team knows that they are not on duty and that they will be able to stay heads-down on their sprint tasks without worrying that a support issue will derail them.

  • Reduces or eliminates the hidden cost of context switching.

  • PMs and other interested parties have a specific person they can contact with support tasks

What qualifications should the firefighter have?

This role works best when the person responding to incidents has intimate knowledge of the product. They don’t need to be experts in every aspect of the product, but they should know how pieces fit together and have basic knowledge in all parts of the stack. The most important qualification they can have is good communication skills. They need to be able to effectively communicate the impact of a bug, a timeline to fix it, and last but certainly not least, be able to say “I don’t know” and ask for additional help.

One of the hardest things for engineering teams to plan for is the work you can’t plan for. Ideally we would never have bugs and there would be no need for this. We all know this will never be true, but we can plan for it and mitigate the effect it has on the team through good planning and people utilization. You are effectively giving up a person for planning purposes, but in a team that spends a significant part of the workload in support issues, the hidden cost of context switching is almost certainly a higher cost than dedicating a single person to these issues.