Back to videos

How to Reduce Alarm Noise

In any company, 50-80% of the alarms are noisy. Employees get trained to snooze these alarms – which isn’t always the right thing to do. Wouldn't it be better if you could easily see which are your top issues each week, and which alarms might be set incorrectly?
3 min
play_arrow
Summary

We saw 2,100 incidents for a beta customer with our new, free tool - Incident Insights.

Interestingly, he had expected this number to be 100.

It’s because his team was dealing with noisy alarms.

In fact, in any company, 50-80% of the alarms are noisy.

That’s why your employees have alert fatigue and get trained to snooze these alarms – which isn’t always the right thing to do.

Wouldn't it be better if you could easily see which are your top issues each week, and which alarms might be set incorrectly?

If the mean time to resolution is really short, it means someone just clicked OK to close the alert.

You probably need to get rid of that alarm or change its threshold.

Or you could use Shoreline as the recipient of your high-level alarms and convert them to precise ones.

For example, let’s say the CPU is high and…
- the input request rate is also high while the latency is not changing →  This means the system is operating normally. We don’t need to involve a human.
- the JVM process is garbage collecting →  Here, we should collect a heap dump and bounce the pod. We still do not need to wake anybody up.
- you have no idea why it’s happening → Now you need to wake someone up and give them a notebook that tells them all the diagnostics you've run in advance.

In all cases, you should start with noise reduction by:
- changing alarm thresholds (can't be too low, can't be too high)
- getting rid of alarms that don't do anything for you
- making your alarms precise by combining metrics, logs, and system states to figure out what's going on.
- automating away repetitive tasks.Back at AWS, I used to get a report every week from my team on all the incidents from the prior week to do exactly this work, but it took them a long time to generate.

So I'm really happy to introduce this new tool that gets you the same value for free and without any effort.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
13 min
Shoreline Incident Automation Demo
See Shoreline in action, debugging an incident and automating remediations in a fraction of the usual time.
2 min
How Does Shoreline’s Incident Insights Work?
I know I should apply continuous improvement to operations. But where do I start? See how our free Incident Insights tool helps you remove noise and increase signal, making your team more productive and reducing costs by decreasing toil.
5 min
A Guide to Building Reliable Systems
When designing reliable systems, you need to look at correlated events and their downstream impacts, the time it takes to repair them, and the breadth of the system being applied to.