Back to videos

How to Manage Your Operational Data Efficiently

"How long should we keep operational data?"
3 min
play_arrow
Summary

Customers often ask us this question, primarily because it's expensive to store. Let's look at a couple common cases and see how to manage operational data efficiently.

Case 1: an ongoing event

If an event is going on right now, you want real-time data, maybe up to per-second granularity, to debug a live event without having to query each box separately. Here's how most companies mishandle it:

Even though production ops at its core is a distributed system, they handle the events by pulling all the data into one system, which:

  • creates lag and inconsistency across your data silos,
  • prevents them from knowing what's going on right now, and
  • costs them a lot of money because they end up storing a lot of unneeded metrics.

At @Shoreline, we believe that the ground truth is in the boxes you manage. We treat the distributed system like a distributed system by pushing the questions you ask out to the nodes and pulling the answers back to have a real-time view per second on metrics, resources, and the output of Linux commands.

Case 2: Operational reporting

If you want to do operational reporting over, let's say, the last month, the data doesn't need to be as high grain. You need accurate, high-fidelity information for the issues that occur to keep track of trends or anomalies. But you don't care about the rest of the data. This is how we deal with it at Shoreline: (This is going to get a bit technical…so buckle up!)

We transform the raw data into the frequency-time domain using Wavelets. Wavelets is the same technology that underpins MPEG and JPEG. It gives us great compression – about 40x, if you could believe it, which enables:

  • high-resolution per second data
  • looking at the trends over time because you're looking at the curve to match if an event occurred in the past.

All that geeking out aside, the basic point is that you need:

  • live high-resolution data, and
  • a cost-effective way to retain it for a long time.

We believe that people shouldn't store operational data for a long time because we don't think people will look at it. But we make it efficient for them to do so. A 100 metric sampled/second costs us about $0.25/host/year. That's so inexpensive we don't even bother charging for it right now.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
Shoreline Makes Production-Ops Smarter and Faster
Often people try to build a solution like Shoreline on their own. Here's why they fail.
2 min
Niall Murphy on his experience with Shoreline's Incident Automation Platform
Niall Murphy, former SRE at Google and Microsoft and author of the O'Reilly book, Site Reliability Engineering, shares his experience of using Shoreline's Incident Automation Platform.
1 min
Shoreline Operations Notebooks
Record, curate, and publish incident debug and repair best practices to safely empower on-call teams.