Back to videos

Automate Based on Frequency not Recency

Beware of recency bias when automating incidents!
2 min
play_arrow
Summary

The most recent incident isn't a great predictor of the next thing that might happen. So when deciding what to automate, you must get over the urge to automate the most recent incident and focus on the frequent ones.

It's because those are the ones that:

  • reduce the most toil, giving you more bang for the buck.
  • you understand best because you fix them again and again.

Of course, it's a bit more complicated than what I just described. Sometimes you want to automate something because:

  • you don't want operators to access customer data on their machines.
  • you want to reduce the chance of someone making a consequential mistake with a manual change.

But mostly, it's bang for the buck. When I was at AWS, I'd make my service teams automate 1 incident per week - just 1. But if you do that consistently, you'll be down about 60% over the year, maybe more. That matters a great deal!

Part of what we did with Shoreline is reducing the cost of the cost-benefit equation so that you don't have to think that hard about what to automate next. SREs are busy, so automation won't happen unless it's easy to do. That's why most automations in Shoreline can be built in a few hours. Fixing things forever only takes about 2x the time as fixing it one time. I really want it to take the same or less time.

But it's already an order of magnitude better than the other systems I know. You just need to be disciplined about continuing this process of automation: One incident every on-call rotation. That's how you improve availability and reduce toil for yourself, your colleagues, and your customers.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
Why You Need Automation Today
A ton of tools help you observe your environment and maybe half a ton help you route things and deduplicate them. But there's hardly anything out there that actually fixes your environment. That's the reason we need automation in production ops today.
3 min
How to Reduce Waste for Unexpected Demands
Shoreline's back ends are low utilization most of the time. But once an hour, we pull telemetry data from all agents, resulting in a CPU, memory, and network utilization spike. See how we convert over-provisioned resources for demand spikes to waste and eliminate it.
4 min
Shoreline on Shoreline: Idle EC2 Cost Savings Op Pack
Hear from Shoreline Op Pack Engineer, Kaustubh Prabhakar, on how valuable it is to use our Idle EC2 Cost Savings Op Pack.