Back to videos

How to Fix an Incident Before It Happens

It requires predictive maintenance, including monitoring brownout and performing control actions
2 min
play_arrow
Summary

The best way to avoid downtime from an incident is to fix it before it happens.

That requires predictive maintenance. Here are 2 approaches that work for me:

1. Monitoring brownout

Things usually brown out before they black out, so that's when you want to see what's going on.

At AWS, my engineers used to count device errors on all our devices and then:
- correlate the number in a given period with a subsequent device failure, or
- see when device latency started to exceed some normal range.

Then we’d proactively shift away from those bad resources during a maintenance window rather than having a fail when it's under load.

We now use this process at Shoreline.

2. Performing control actions

Here, we adapt the feedback control theory from industrial systems.

The basic concept is that you have the desired state, the observed state, and the error between the two.

Your goal is to create a control action loop that reduces the observed and desired state gap.

The more frequently you sample, the smaller the control action needs to be.

Let’s understand this with an example:When I started driving, I’d swerve the steering wheel.

But now, I don’t need to do that as I keep making little adjustments without even paying much attention.

At Shoreline, we keep running these control loops.

Every second, we scrape 1,000s of metrics, compare them against 1000s of alarm conditions, and take little control actions to make things a little bit better.

This helps keep the systems our customers manage using Shoreline running smoothly.

That’s how we reduce the downtime by fixing the incidents before they happen.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
2 min
How to Bring Continuous Improvement in Operations
I deeply believe in making things 1% better each and every week by improving the performance of the software I've been responsible for and keeping my services up. Let’s talk about bringing continuous improvement to operations.
6 min
Automation Anywhere Connects Sumo Logic with Shoreline for Auto-remediation
Automaton Anywhere links Sumo Logic's data and log monitoring with Shoreline's automated incident repairs to improve customer experiences and save Dev time
3 min
Shoreline Makes Production-Ops Smarter and Faster
Often people try to build a solution like Shoreline on their own. Here's why they fail.