Back to blog
Reliability

Runbooks vs Playbooks: explaining the difference

The terms runbooks and playbooks are often used interchangeably by SREs. They are similar, but this post explains the differences so you can pair the two together as part of your operational excellence.
Austin Gunter

Introduction

The terms runbooks and playbooks are often used interchangeably. They are similar--they both offer a method of documenting tactical and strategic executions of the goals and processes of your organization. But there are important differences between the two, and each has its place. Once you understand that difference, you will be able to effectively use them individually, but even more importantly, you will be able to pair the two techniques together, creating a powerful weapon in your arsenal of operational excellence.

So let's look at the differences between a runbook and a playbook, what each is used for, and how they can be used together.

What Is a Runbook

Runbooks are best defined as a tactical method of completing a task--the series of steps needed to complete some process for a known end-goal. Examples include “Restarting the web services on frontend servers” to “Deploying the newest build of staging application”.

Runbooks are particularly useful when defining a specific action for an identified problem. They define the exact steps to make that action repeatable and usable as a programmatic approach to problem solving. A well-written runbook not only lowers the difficulty of execution and ensures repeatability, but also has the end-goal of automating the action, making the runbook, itself, no longer necessary.

What Is a Playbook

A playbook, on the other hand,is a little broader. It is the culmination of those tactical processes, creating a larger plan focused on strategic action. They are a checklist of formal steps and actions. This can be anything from “Upgrading fleet-wide OS images” to “Managing a production incident." Playbooks contain actions that can be automated, but also actions that decisions that need to be made by a human.

This playbook methodology of thinking about a holistic process allows for identifying where runbook-type processes are used and can be replaced by simpler tools or automation. Developers call this approach of reducing copy & paste actions “DRY” or Don’t Repeat Yourself. DRY can be adopted by ops teams as well by defining the goal of a process in such a way that it can be summarized by a consistent set of runbooks.

One way to think about the difference between runbooks and playbooks--a playbook is like a book with chapters, and some of those chapters are runbooks.

A Newly Discovered Process

Let’s take a deeper look at how a process can be broken down into these concepts, and some benefits that can come out of this exercise.

We get a report from our customer service teams that our web page is intermittently no longer responding and that the users are complaining.

To start, we must first identify a way to reproduce the problem to know where to start the investigation. Then, we need to determine the cause of the unavailability. Assuming there is a single server with an issue we must decide how to mitigate this impact and what the appropriate actions are to ensure that leadership, partners, and customers are kept informed of the impact.

Along with deciding on mitigation, we need to determine if we have enough capacity to handle the load across the remaining servers; we may even need to wait for our subject matter experts to come up to speed on the problem and propose potential fixes for the issue.

We may decide to mitigate the issue by updating our load balancers to remove reference to the server that is not serving healthy responses.

If the right infrastructure owners are available, they can take an action to remove the service.

We may decide to collect logs to debug the issue further after the issue is resolved.

Likely the infrastructure owners may not be as versed in the applications and servers as the developer or operations team, so there may be waiting to get the right teams engaged.

Then the servers should be restarted, and observed to ensure they are serving pages successfully.

Only then can we return our server to our load balancer.

Some follow up must take place to debug the gathered logs and ensure our customers are updated with the details they need from our outage.

After this, we can take the time to understand what went wrong and what to do moving forward. There may be a period of time this manual process needs to be followed until the root cause is resolved and processes are implemented to speed future investigations.

This series of events can easily be converted into a series of runbooks per task and an overall playbook of managing customer-impacting incidents. Furthermore, addressing the root cause of this particular problem doesn’t invalidate the playbook or runbooks because they can be recycled for future problems and processes.

Applying Our Books To Our Process

Now let’s look at that same experience with a defined playbook and its corresponding runbooks!

Customer service teams report that our web page is intermittently no longer responding and users are complaining again.

Our first responders can refer to our newly minted playbook:

<code-embed>**Playbook:** Managing Website Outage
       **Playbook:** Mitigate Frontend Application Impact
           Execute **runbook:** Inspect Load Balancer Logs
           If Load Balancer logs report a single server
               Execute **runbook:** Remove server from Load Balancer
               configuration
           Execute **runbook:** Collect Server Logs
           Execute **runbook:** Restart Application Servers
           Execute **runbook:** Test Application Server In Isolation
           If Application Server healthy:
               Execute **runbook:** Return Server to Load Balancer
               Configuration
   **Playbook:** Post-Mortem
       Execute **runbook:** Create Ticket with Server Logs
       Execute **runbook:** Create Chat with Infrastructure & Application
       Teams
       Execute **runbook:** Communicate with Customers<code-embed>

This curated set of instructions can now be applied to future outages of this nature. This will greatly reduce:

  • The mitigation requirements of waiting for experts to be engaged
  • The time-to-mitigation
  • The overall stress and human mistakes
  • The overall toil of re-learning the mitigations

Should there be new methodologies, new impact or actions, or even updated processes for handling incidents post-mortem, this playbook can be modified accordingly to ensure there is always a plan, even for the most complex situations.

As the incident lifecycle process continues and the root cause is identified and resolved, these same steps apply for any issue that involves future impacts of this nature, and these same runbooks can be linked in other playbooks to reduce the amount of times “Inspect Load Balancer Logs” needs to be redefined.

The best part of embracing this methodology is that these same tools can be used as a springboard for maturing your operations processes. You can use these runbooks to build automations that execute these defined processes with the ultimate goal of automating entire playbooks.

Conclusion

Runbooks and playbooks are tools best used in tandem. But also build your organization towards a greater goal of removing human toil altogether by embracing practices that have been built into development workflows for years.

FREE Runbooks for Debugging Kubernetes

Shoreline can help you cut hours off your debugging sessions with our pre-built Kubernetes runbooks for Nodes, Deployments/Pods, and Services.
Get Your Runbooks

Ready to give Shoreline a try?

Join a growing community of companies making on-call better
Request demo

Find more Shoreline resources

Looking for more information? Visit our other resource sections