Your application runs 24/7—but your team doesn’t?
What happens when a major user experience (UX) flaw is identified, preventing highly engaged users from converting through your product? Or, worse yet, a hacker breaches into your system and there’s no IT person around to see it?
While it’s not as typical to imagine your engineering team on call as you might imagine your doctors to be, on-call engineering is becoming increasingly common. As applications become more complex, your user base grows, and the business demands you to strive for little to no application downtime, how can you possibly ensure 24/7 support?
- What is on-call engineering?
- On-call engineering pros and cons
- 10 on-call engineering best practices
What is on-call engineering?
On-call engineering is the practice of always having several engineers on a rotational “on-call” system, which works quite similarly to how doctors, lawyers, and firefighters stay ready to respond to any critical needs.
During regular working hours, select engineers who are on rotation need to keep an eye on critical application elements to ensure they’re always working as planned. This can include both bugs and security vulnerabilities. After hours, on-call engineers check in with the application regularly and are the “first responders” if something goes down. Having them available in a moment’s notice ensures your application can get back up to speed faster, with minimal risk or impact to your customers (and revenue!).
Build a culture of effective meetings with your Engineering team
Level up your meeting habits to boost engagement and productivity with a collaborative meeting agenda. Try a tool like Fellow!
On-call engineering pros and cons
Your application is supported in case of any crashes. If something goes sideways, you have a dedicated team that can respond immediately. Better yet, you can have a combination of engineers (for example, DevOps, site reliability, security, etc.) to have coverage on your application from every angle. This enables your on-call team to be self-sufficient in handling any problems.
You build confidence in your team to take initiative. When issues arise, your team needs to be able to react fast. Team members need to be able to trust their instincts, build a solution in the moment, and work as a team to get the solution implemented ASAP. Providing an opportunity for them to face these challenges on their own (ideally with a set of resources, a handbook, or your contact information also available) helps them get comfortable with the idea of trusting their own decision-making skills.
Employees can feel like they’re working overtime. In some cases, this might be true. If a major element of the application crashes after regular working hours and the on-call engineers need to spend a considerable amount of time on patching the issue, this can make employees feel overworked. You can mitigate this by managing working hours and compensation, but some employees still prefer to work a specific set of hours each day.
You’re adding another administrative burden to schedule rotations. Depending on the size of your team, scheduling your on-call engineers can be just another item on an already long to-do list. Automation tools can make this process easier, but may take away an element of control. So, you have to be prepared to compromise on either element (either your time or control).
10 on-call engineering best practices
- Set a clear rotation
- Provide sufficient training
- Look for signs of burnout
- Define clear responsibilities
- Create a psychologically safe on-call culture
- Foster trust between engineers
- Be transparent
- Separate on-call duties from other engineering responsibilities
- Monitor on-call reports and provide feedback
- Assign primary and secondary responders
1Set a clear rotation
One way to lower the administrative burden of adopting on-call engineering practices is to make the schedule repeatable. Having engineers on a set rotation also helps them plan their daily schedules further in advance, and get comfortable with their routine (which is great, as highly productive people work well with routines!). They also then get an opportunity to build stronger relationships with the other engineers on their rotation, which will come in handy when working on a solution in a time crunch if a problem arises.
2Provide sufficient training
Training your team is a great way to build confidence. Whether you choose to bring in an external consultant to guide you on the process for the first time or you choose to incorporate internal training exercises, offering some resources for your team to learn from will give them foundational knowledge to kick off their first few on-call shifts.
If you have someone on the team who has worked in an on-call environment before, ask them if they would be open to sharing a few tips and tricks. If your team is operating on a low budget or cannot free an entire day for a consultant to come in, consider doing online research on best practices (like those in this article!) or reach out to your network for advice on getting started. Sharing all the insights you have that could help your team ace on-call engineering builds trust and alignment.
3Look for signs of burnout
Developer burnout happens frequently. Signs of burnout include lack of engagement in projects, constant fatigue, and decrease in productivity or quality of work. As a manager, you can take steps to identify and mitigate burnout by having regular check-ins with each member of your team. Sometimes people do a great job of hiding burnout in group settings, so regular one-on-ones can help you see the signs a bit sooner.
4Define clear responsibilities
It’s easier to take on a new role if it’s crystal clear what the new role entails. Outline things like the hours of on-call shifts, the responsibilities that each on-call member has during that time, how the on-call role works in conjunction with the rest of each individual’s role, and what to do in case of emergencies. Preparing your team with as much information as possible can enable them to create the best solutions when they’re on call.
Note that defining clear responsibilities might also mean having one person with special access to other systems or controls during the on-call period. Everyone who is on call should know who this authorized person is, and how to contact them if needed.
5Create a psychologically safe on-call culture
A psychologically safe culture invites employees to take risks and safely experiment within their projects without fear of punishment or negative consequences if their proposed solution fails. For on-call engineers, this means being able to implement a great solution with little to no time, and possibly little resources. As a manager, you can create this environment by encouraging your developers to take smaller risks in their everyday work and integrating constructive feedback cycles. This helps employees identify areas for improvement in a positive setting while also preparing them to face risky decisions in the future.
6Foster trust between engineers
Not only should you build trust between yourself and your engineers as you build a psychologically safe environment, but it’s also critical that this trust is built between your team members. Having trust amongst the team is essential, as it will allow team members to work together effectively during times of crisis. You can start fostering trust through hosting brainstorming sessions, encouraging teamwork, and placing engineers on the same rotations with each other.
As you move into working with on-call engineering practices, you might face challenges. Being transparent with your team as you navigate these challenges is really helpful as it allows team members to see the full picture of how the transition into the new processes works. Additionally, transparency is a great trust builder! Individuals who feel included and up to date on the challenges faced by the team can leverage teamwork to solve problems with others.
8Separate on-call duties from other engineering responsibilities
On-call engineering means that the set of developers who are on rotation need to prioritize the critical issues at hand and not their regular responsibilities. It’s up to you as the manager to decide if there is enough work in being the on-call engineer to completely (temporarily) set aside other engineering responsibilities. Or, you might suggest that the on-call engineers continue regular work until an issue that needs their immediate attention arises. Establishing this difference early on as you adopt on-call practices can help your team build this balance into their schedule more efficiently.
9Monitor on-call reports and provide feedback
Giving constructive feedback to your team is one of the best ways you can help them learn. This is especially true when you’re in the early stages of adopting this new practice.
With a tool like Fellow, you can easily add notes into meetings and attach reports to your meeting agendas to create talking points quickly. From there, you can also add action items so both you and your engineers can go back to see what was discussed during your meeting and what needs to be completed following the meeting—this ensures that feedback is retained and practiced consistently.
10Assign primary and secondary responders
Assigning primary and secondary responders is an important part of clearly outlining the responsibilities of on-call team members. The primary responder is the first person who will take on an issue when it arises, and the secondary responder will support the primary responder if additional help is required. Depending on the size and complexity of your application, you might consider having multiple responders who are each responsible for a specific part of the application. Just ensure that each primary has a secondary available to assist them.
It’s exciting to think of your engineering team working together more collaboratively for the promise of a more stable, secure application. As you transition to on-call engineering, keep communication channels with your team open to sort out any kinks in the process as early as possible. If you build a safe environment where your team members feel comfortable calling out obstacles in the on-call program, you’ll be able to move towards a more efficient program sooner!