Back to videos

How to Solve the Challenges of MELT Data at Scale

The bigger the data set, the slower it is to analyze. For MELT, you need to be able to execute a query at scale across your fleet and see what's going on in the live environment. That’s why, at Shoreline, we favor modeling the distributed system as a distributed system.
4 min
play_arrow
Summary

Recently, I read a paper by Slack on managing MELT challenges at scale.

(MELT stands for 4 data types: metrics, events, logs, and traces. The paper is definitely worth the read.)

I feel like their approach – which combines Prometheus, Kafka, Secor, S3, Spark, Elastic search, and Presto – is too complicated.

Because:
- it's super expensive
- there are a lot of parts that can break down
- you have to keep in mind a lot of things just to run a query

The core issue is that they're converting a distributed systems problem into a centralized problem by pushing data to a central location.

This approach is fundamentally broken because it requires storing terabytes of data and pushing lots of traffic over a network when you don’t even need it 99.99% of the time.

And the bigger the data set, the slower it is to analyze.

So when you need it, it's slow.

Further, it's fragile because the time you need to observe your system is exactly the time when something's gone wrong.

In a network event, for example, this is often when you’ve stopped getting telemetry.

Finally, you have to be a wizard to predict what you'll care about in the future because if there's a new event, you won't have a dashboard/log handy.

For example, at AWS, one of the large-scale events I ran into was due to a BIOS upgrade by EC2.

But there was no way I’d be logging or metricizing what version of the BIOS I have.

For such things, you need to be able to execute a query at scale across your fleet and see what's going on in the live environment.

That’s why, at Shoreline, we favor modeling the distributed system as a distributed system.

We keep the data locally at the edge and process it locally.

We invest in sophisticated data query processing to execute commands in a parallel distributed manner across data, in real-time, with fault tolerance.

So we have an agent that collects data, analyzes it, and takes action when necessary.

Here are its advantages:
- It scales with your fleet size by using a tiny bit of resources on each node.
- As you increase the number of nodes, it gets scaled automatically.
- There's no network latency because you don't have to push data to some central location.
- Since it's running at the edge, the mean time to diagnose and repair for automated actions can be reduced to seconds.
- It's intrinsically fault-tolerant because the edge node can take its local actions autonomously.

So, whether you use Shoreline or not, you must consider building these systems in a distributed, fault-tolerant manner.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
5 min
Reliability Engineering: The Southwest Debacle
Because it's less expensive and quicker for passengers, Southwest operates on a point-to-point model. Any disruptions in one route affect the entire chain. But to engineer a reliable architecture, you need to balance cost versus reliability in an economically constrained way.
1 min
Using Shoreline.io to root-cause transient issues (like JVM garbage collection)
Shoreline makes it easy to collect diagnostic information when you're doing a root-cause analysis of an issue. This example shows how to automatically capture debugging information for slow Java garbage collection and then automatically bounce the process to alleviate customer pain.
1 min
Shoreline End-to-end Automation
Easily and safely automate incident remediations with a few lines of code.