Back to videos

How to Do Continuous Improvement in Operations

Things that enabled me to do more with lower cloud computing costs
2:40 min
play_arrow
Summary

“How do I do continuous improvement in operations?” You do it by creating a culture around it. Let’s understand this with an analogy to Agile and quality improvement.

A lot of Agile is about creating continuous improvement and automation where you need to figure out 3 things:

- an output metric

- an input metric that drives the output metric

- the work item that drives the input metric

In quality:

The output metric = The number of defects that escaped your QA and testing process and made it into the wild

The input metric (my preference) = The percentage of automated testing or your code coverage

The work item = Building test cases

Similarly, in operations:

The output metric = The number of tickets

OR

The output metric = The number of tickets x the duration of the event x the number of people impacted.

I prefer the latter because something that affects a lot of people is more important than something that affects just one.

The input metric = The number of automations you've built. It's hard to go back and fix all your code, so you must remediate it. You need to employ the machine to fix issues in a few seconds rather than having a human do it in an hour or more, especially when many people are impacted.

The work item = Building the automations. How do you do that? The good news is that you get ~100 new tickets every week. Just automate one per week. If you run that loop every week, things will get better and better over time.

That's how you do continuous improvement in operations.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
1 min
Shoreline Operations Notebooks
Record, curate, and publish incident debug and repair best practices to safely empower on-call teams.
2 min
Automate Based on Frequency not Recency
Beware of recency bias when automating incidents!
2 min
The Best Way to Improve Your On-Call
No one wants to do on-call because you can't control when the incident happens. Improve your on-call by building automations that eliminate common production incidents.