The hidden cost of toil
02 Mar 2025
Catamaran Cruise, Mykonos, August 2024. @babafemio
As a technology leader, I’ve observed how unnecessary toil can subtly impact even the best workflows and the performance of the most hardworking teams. So, what exactly is toil, and why must Site Reliability Engineers (SREs) address it? In this article, we’ll explore the essence of toil and its effects on teams. Most importantly, we will discover effective strategies to minimize it through key SRE principles.
What is Toil?
Toil is not merely about hard work or tedious tasks; it refers to the effort required to restore your service to acceptable levels after a failure or performance degradation. For instance, when your service is operating at 99.9% uptime and experiences a sudden increase in customer load that drops the uptime to 99.5%, or if a bug is introduced after deploying a new feature, all the work you do to bring the service back up to that 99.9% level is considered toil. Activities like patching servers, applying fixes, and troubleshooting issues are all examples of toil.
The SRE perspective
So, why should Site Reliability Engineers care about minimizing toil? The answer lies in SRE’s six core principles, prioritising availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. By focusing on these areas, SREs can reduce the time spent on toil and increase their overall productivity. The six core principles that underlie the SRE approach are as follows:
- Operations as a Software Problem: Treat operations challenges as software engineering problems
- Manage by Service Level Objectives (SLO): Define and track clear service level objectives
- Work to Minimize Toil: Actively reduce manual, repetitive work through automation
- Move Fast by Reducing the Cost of Failure: Design systems that can safely fail and recover quickly
- Shared Ownership with Developers: Foster collaboration between SRE and development teams
- Use the Same Tooling Regardless of Function or Title: Maintain consistency in tools across teams
Reducing toil
So, how can you minimize toil in your SRE workflow? It starts with automation. By automating repetitive tasks and monitoring your system for potential issues, you can reduce the time spent on manual troubleshooting and break-fix work. For example, in the case where a new bug reduces the uptime of a service, the SRE should consider implementing automated rollbacks if SLOs are being impacted after deployment. This is assuming the team has implemented a continuous integration/continuous deployment (CI/CD) pipeline that has automated the deployment. Manual deployment is also toil, by the way.
Another way to reduce or eliminate toil is to address the root causes that create it in the first place. For activities such as applying patches, the Infrastructure or SRE teams could consider implementing Infrastructure-as-Code, which may make patching unnecessary. For instance, a team might establish a policy to deploy brand-new, up-to-date servers into production on a monthly basis using tools like Ansible or Terraform. This approach has the added benefit of ensuring that the servers or virtual machines are clean, free from configuration drift, and, in some cases, devoid of malware.
The benefits of minimizing toil
Reducing toil offers significant benefits for Site Reliability Engineers (SREs) and organizations as a whole. By minimizing manual work and maximizing automation, you can:
-
Increase efficiency and productivity: With less time spent on repetitive tasks, your team can concentrate on more strategic projects. This allows you to allocate resources to tackle more complex issues.
-
Enhance customer satisfaction: By maintaining consistent service availability and optimal performance, you can improve the overall user experience.
-
Improve work-life balance: For many SREs, toil can be overwhelming, consuming much of their free time. However, when automation handles these routine tasks, you have the opportunity to focus on personal projects, spend quality time with loved ones, or simply relax and recharge.
Putting it into practice
So, how do you get started with minimizing toil in your SRE workflow? Here are a few concrete steps: Automate repetitive tasks: Use tools like Ansible or Chef to automate configuration management and deployment. Implement monitoring and alerting: Set up monitoring tools like Grafana or AppDynamics to detect potential issues early, and implement alerting systems to notify your team quickly. Focus on high-level tasks: With automation now handling lower-level tasks, focus on higher-level work like product architecture and deployment strategies.
Conclusion
Minimizing toil is a crucial part of the SRE approach. By following these principles and best practices, you can reduce the time spent on manual work and increase your overall productivity. By automating repetitive tasks, implementing monitoring and alerting systems, and focusing on high-level tasks, you can create a more efficient and effective workflow that benefits both your team and your customers.