Back to videos

How to Fix an Incident Before It Happens

It requires predictive maintenance, including monitoring brownout and performing control actions
2 min
play_arrow
Summary

The best way to avoid downtime from an incident is to fix it before it happens.

That requires predictive maintenance. Here are 2 approaches that work for me:

1. Monitoring brownout

Things usually brown out before they black out, so that's when you want to see what's going on.

At AWS, my engineers used to count device errors on all our devices and then:
- correlate the number in a given period with a subsequent device failure, or
- see when device latency started to exceed some normal range.

Then we’d proactively shift away from those bad resources during a maintenance window rather than having a fail when it's under load.

We now use this process at Shoreline.

2. Performing control actions

Here, we adapt the feedback control theory from industrial systems.

The basic concept is that you have the desired state, the observed state, and the error between the two.

Your goal is to create a control action loop that reduces the observed and desired state gap.

The more frequently you sample, the smaller the control action needs to be.

Let’s understand this with an example:When I started driving, I’d swerve the steering wheel.

But now, I don’t need to do that as I keep making little adjustments without even paying much attention.

At Shoreline, we keep running these control loops.

Every second, we scrape 1,000s of metrics, compare them against 1000s of alarm conditions, and take little control actions to make things a little bit better.

This helps keep the systems our customers manage using Shoreline running smoothly.

That’s how we reduce the downtime by fixing the incidents before they happen.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
How to Manage Failure without Wasting Resources
How can you better utilize the resources you keep aside for failover purposes? Here's how we utilized resources kept just for failover purposes to do things that could be stopped for some time when a failure happens and had resources doing useful background activity that can be deferred to when things hit the fan.
2 min
About Company Values
Part of the reason to create a company is to create the environment you want to be in.So it’s important that you reflect your values in your interview process. Otherwise, the sheer number of people joining will dilute things.
1 min
Using Shoreline.io to root-cause transient issues (like JVM garbage collection)
Shoreline makes it easy to collect diagnostic information when you're doing a root-cause analysis of an issue. This example shows how to automatically capture debugging information for slow Java garbage collection and then automatically bounce the process to alleviate customer pain.