Back to videos

How to Fix an Incident Before It Happens

It requires predictive maintenance, including monitoring brownout and performing control actions
2 min
play_arrow
Summary

The best way to avoid downtime from an incident is to fix it before it happens.

That requires predictive maintenance. Here are 2 approaches that work for me:

1. Monitoring brownout

Things usually brown out before they black out, so that's when you want to see what's going on.

At AWS, my engineers used to count device errors on all our devices and then:
- correlate the number in a given period with a subsequent device failure, or
- see when device latency started to exceed some normal range.

Then we’d proactively shift away from those bad resources during a maintenance window rather than having a fail when it's under load.

We now use this process at Shoreline.

2. Performing control actions

Here, we adapt the feedback control theory from industrial systems.

The basic concept is that you have the desired state, the observed state, and the error between the two.

Your goal is to create a control action loop that reduces the observed and desired state gap.

The more frequently you sample, the smaller the control action needs to be.

Let’s understand this with an example:When I started driving, I’d swerve the steering wheel.

But now, I don’t need to do that as I keep making little adjustments without even paying much attention.

At Shoreline, we keep running these control loops.

Every second, we scrape 1,000s of metrics, compare them against 1000s of alarm conditions, and take little control actions to make things a little bit better.

This helps keep the systems our customers manage using Shoreline running smoothly.

That’s how we reduce the downtime by fixing the incidents before they happen.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
Is Automation Too Time-Consuming?
Automation takes us too much time. The problem with this approach is that 48% of incidents are straightforward and repetitive. Don't have people fix them manually. Teach the computer how to do it.
3 min
How to Manage Your Operational Data Efficiently
"How long should we keep operational data?"
2 min
Niall Murphy on his experience with Shoreline's Incident Automation Platform
Niall Murphy, former SRE at Google and Microsoft and author of the O'Reilly book, Site Reliability Engineering, shares his experience of using Shoreline's Incident Automation Platform.