Software-as-a-service organizations live and die on service reliability and uptime. Having an “on call” team (formal or otherwise) that can quickly react to incidents and outages is mission critical to the business. Hootsuite is no different – our customers around the globe rely on Hootsuite.com being available 24/7 – but even software engineers need sleep. So, who holds down the fort if the unthinkable happens outside business hours?
This year we overhauled our On Call practice to version 3.0. This post walks through the evolution of our On Call model: the stark reality of 1.0, the noble intentions of 2.0, and the practicality of 3.0.
On Call 1.0
Our first On Call system was informal and happened reactively, out of necessity. A handful of original software engineers and operations engineers who built Hootsuite.com were unofficially on call all. the. time. No sleep allowed, no knowledge transfer, no metrics, and no visibility into incidents or issues or remedies.
This persisted for months and months, until the demands of a rapidly growing customer base and product complexity had those engineers close to the breaking point. Burning out our engineers around an inequitable and haphazard routine was totally unreasonable, not to mention an unproductive and inefficient way to respond to incidents and outages.
On Call 2.0
Our Engineering and Operations teams came with a light-weight model that emphasized transparency within our organization, distribution of responsibility beyond a small group, discipline, and education via the experience of real-world troubleshooting:
- A four-person team of two software engineers, one ops, and one coordinator
- Voluntary sign up
- Weekly rotations (Wednesday to Wednesday + formal hand-over)
- Post mortems to find the most likely root causes for any incidents
- Publishing outages, post-mortems, and planned remedies internally
- Following up on planned remedies as a way to bullet-proof ourselves against future failures
This was an effective improvement for about six months. Two issues eventually surfaced: the responsibility of teams handling both On Call and regular project work made it difficult to make meaningful improvements to stabilize our systems (beyond the absolutely necessary remedies outlined in post mortems); and we often learned about system failures from our customers rather than from our own systems.
On Call 2.1
A conversation with our friends at Yammer inspired our next iteration. At that time, Yammer Engineers on call had a single focus: be on call. That made sense: ownership over a problem and time to dedicate solving it should lead to better outcomes. So, we changed our team’s focus – their project work was now secondary to on call. Their primary objective and responsibility was to pay it forward and “make it easier for the next rotation to see, find, and fix issues”.
This change resulted in some incredibly useful system improvements, such as a centralized status boards, a move to a new monitoring tool, graphing improvements, and automated call-tree generation using Twilio.
One of the main issues with our process was the fact it relied on volunteers – a good idea in theory, but not as reliable in real life. In the eight months after On Call 2.0, everyone had access to a shared Google spreadsheet and was encouraged to sign up for shifts at some point in the future. Initially, interest and sign ups were high, but enthusiasm waned over time. This led to a number of challenges, including:
- A lack of signups meant uncertainty of On Call teams beyond the next two weeks
- A severe lack of interest in working holidays or contentious dates
- Inevitable reliance on a small, consistent group to cover most of the work
- A lot of management time spent running around encouraging volunteers
During a retrospective on our model, our team surfaced that the majority of them often felt helpless during an outage. Additionally, we were asking our On Call team to support a small number of very sketchy systems that included a handful of SPOFs (single points of failure), special snowflakes (manually configured production machines that could not be easily recreated in the event of catastrophe), and systems with no documentation or monitoring. When an outage occurred on one of these systems, the negative impact on the morale of our teams was huge. Something had to change.
On Call 3.0
We set these objectives to guide our next iteration:
- Set the Operations and On Call teams up for success
- Establish clear roles and responsibilities
- Establish a fair and predictable schedule
- Provide education and training
- Record metrics
What has worked for us (so far)
Set the Operations and On Call Teams up for Success
We established a contract between Operations and our software engineers: allow Operations to define what a supportable system is, empower them to decide what can be supported, and push back to our software engineers if necessary. Here are some examples of their requirements for a supportable system:
- Redundancy (no single points of failure)
- AMIs/Ansible scripts (no manually configured production instances)
- Monitoring (no system state mysteries)
- Documentation (playbooks, system overviews, etc)
Establish Clear Roles and Responsibilities
We restructured our On Call team to be more lean by scaling down from four people to three: an Operations Engineer, a Software Engineer, and an Incident Coordinator. The call-tree also works in that order. This team is our first line of defense, and if they are unable to respond to an outage, they can call upon any number of specialists (experts in our various functional areas) to help. The specialists can be called on to either guide the software engineer, or take on problems directly. The Incident Coordinator is there to marshall, communicate statuses with the rest of the company, and conduct post-mortems.
Establish a Fair and Predictable Schedule
We moved our on call scheduling from a Google spreadsheet to PagerDuty (PD). We were already using PD for our Operations Engineers, so adding the Software Engineers was simple. In PagerDuty, we have our engineering contact information, schedules, and our escalation policies in a central location with access for everyone involved. As an added bonus, PagerDuty has a pretty nice API that we’ve started using to automate our support tools. We also expect that all engineers participate in On Call.
Provide Education and Training
We implemented training that includes a Failure Friday program at Hootsuite, where our Operations team walks our Software Engineers through outages to build their confidence around our systems and our outage response processes. The purpose is to share Operations expertise and knowledge, so everyone on call feels empowered during their rotation and in their everyday work.
This is still a challenge and an opportunity. At first, we simply tracked “did we learn about this failure before our customer?”; then time-to-respond, and time-to-resolve. These are not tracked over time because each incident usually quite different. Etsy has been a pioneer in this area, and have been a source of inspiration and tooling.
Functional Area Rotations
The idea was to establish a number of rotations around our functional areas, and have the On Call team composed of engineers that worked on these products. Though the match between engineer and area would be perfect, we have many functional areas that vary greatly by size. In this approach, a number of people would be on call every other week, and that just isn’t fair.
Use a Dedicated On Call Team
This is potentially a great solution, but it’s not a timely fix. We may consider transferring the responsibilities of the On Call team to a dedicated group of experts in the future, but we needed an immediate solution that doesn’t allow time for the recruitment and training process.
On Call 4.0?
At Hootsuite, we embrace iteration. This version of On Call is the most recent, but it’s certainly not the last. Our engineers constantly improve our process through retrospectives and by having dedicated time to build solutions that make On Call better.
Our solution evolved through experimentation and feedback: from the participants, and stakeholders such Customer Support. Every company has their own On Call solution – what’s yours? Leave a comment.
To summarize, these are the items we found helpful:
- Establish an Operations Contract that includes criteria for hand off from engineering teams
- Establish a rotation for all your software engineers, making the rotation as visible as possible. Consider using a 3rd party tool like PagerDuty.
- Give your engineers the opportunity to make On Call better. Consider making it their primary responsibility.
- Turn On Call from a burden into an opportunity to learn from failure
- Measure your impact so you can judge your progress and celebrate your accomplishments
About the author: Ken is man of leisure, gentleman programmer, father, and Director of Engineering – Publishing @ Hootsuite. Follow him on Twitter: @k_britton