Back to videos

How to Manage Your Operational Data Efficiently

"How long should we keep operational data?"
3 min
play_arrow
Summary

Customers often ask us this question, primarily because it's expensive to store. Let's look at a couple common cases and see how to manage operational data efficiently.

Case 1: an ongoing event

If an event is going on right now, you want real-time data, maybe up to per-second granularity, to debug a live event without having to query each box separately. Here's how most companies mishandle it:

Even though production ops at its core is a distributed system, they handle the events by pulling all the data into one system, which:

  • creates lag and inconsistency across your data silos,
  • prevents them from knowing what's going on right now, and
  • costs them a lot of money because they end up storing a lot of unneeded metrics.

At @Shoreline, we believe that the ground truth is in the boxes you manage. We treat the distributed system like a distributed system by pushing the questions you ask out to the nodes and pulling the answers back to have a real-time view per second on metrics, resources, and the output of Linux commands.

Case 2: Operational reporting

If you want to do operational reporting over, let's say, the last month, the data doesn't need to be as high grain. You need accurate, high-fidelity information for the issues that occur to keep track of trends or anomalies. But you don't care about the rest of the data. This is how we deal with it at Shoreline: (This is going to get a bit technical…so buckle up!)

We transform the raw data into the frequency-time domain using Wavelets. Wavelets is the same technology that underpins MPEG and JPEG. It gives us great compression – about 40x, if you could believe it, which enables:

  • high-resolution per second data
  • looking at the trends over time because you're looking at the curve to match if an event occurred in the past.

All that geeking out aside, the basic point is that you need:

  • live high-resolution data, and
  • a cost-effective way to retain it for a long time.

We believe that people shouldn't store operational data for a long time because we don't think people will look at it. But we make it efficient for them to do so. A 100 metric sampled/second costs us about $0.25/host/year. That's so inexpensive we don't even bother charging for it right now.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
Building a Culture Around Reliability
It's not some other team's job to keep your service up. Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities. We all have to own it. That is what a culture of reliability requires.
2 min
How to Reduce On-Call Incidents
Shoreline's recent survey found that 48% of incidents are straightforward and repetitive while 55% of them escalate beyond the 1st line on call. If your on-call sucks, you must find a path to make incidents incidental.
3 min
How to Reduce Alarm Noise
In any company, 50-80% of the alarms are noisy. Employees get trained to snooze these alarms – which isn’t always the right thing to do. Wouldn't it be better if you could easily see which are your top issues each week, and which alarms might be set incorrectly?