Back to videos

How to Solve the Challenges of MELT Data at Scale

The bigger the data set, the slower it is to analyze. For MELT, you need to be able to execute a query at scale across your fleet and see what's going on in the live environment. That’s why, at Shoreline, we favor modeling the distributed system as a distributed system.
4 min
play_arrow
Summary

Recently, I read a paper by Slack on managing MELT challenges at scale.

(MELT stands for 4 data types: metrics, events, logs, and traces. The paper is definitely worth the read.)

I feel like their approach – which combines Prometheus, Kafka, Secor, S3, Spark, Elastic search, and Presto – is too complicated.

Because:
- it's super expensive
- there are a lot of parts that can break down
- you have to keep in mind a lot of things just to run a query

The core issue is that they're converting a distributed systems problem into a centralized problem by pushing data to a central location.

This approach is fundamentally broken because it requires storing terabytes of data and pushing lots of traffic over a network when you don’t even need it 99.99% of the time.

And the bigger the data set, the slower it is to analyze.

So when you need it, it's slow.

Further, it's fragile because the time you need to observe your system is exactly the time when something's gone wrong.

In a network event, for example, this is often when you’ve stopped getting telemetry.

Finally, you have to be a wizard to predict what you'll care about in the future because if there's a new event, you won't have a dashboard/log handy.

For example, at AWS, one of the large-scale events I ran into was due to a BIOS upgrade by EC2.

But there was no way I’d be logging or metricizing what version of the BIOS I have.

For such things, you need to be able to execute a query at scale across your fleet and see what's going on in the live environment.

That’s why, at Shoreline, we favor modeling the distributed system as a distributed system.

We keep the data locally at the edge and process it locally.

We invest in sophisticated data query processing to execute commands in a parallel distributed manner across data, in real-time, with fault tolerance.

So we have an agent that collects data, analyzes it, and takes action when necessary.

Here are its advantages:
- It scales with your fleet size by using a tiny bit of resources on each node.
- As you increase the number of nodes, it gets scaled automatically.
- There's no network latency because you don't have to push data to some central location.
- Since it's running at the edge, the mean time to diagnose and repair for automated actions can be reduced to seconds.
- It's intrinsically fault-tolerant because the edge node can take its local actions autonomously.

So, whether you use Shoreline or not, you must consider building these systems in a distributed, fault-tolerant manner.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
7 min
Debugging an eCommerce Microservice - High Request Latency Debugging with Shoreline
Charles Carey, Shoreline CTO, walks us through Shoreline's automated runbook experience.
3 min
Is Automation Too Time-Consuming?
Automation takes us too much time. The problem with this approach is that 48% of incidents are straightforward and repetitive. Don't have people fix them manually. Teach the computer how to do it.
3 min
Building a Culture Around Reliability
It's not some other team's job to keep your service up. Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities. We all have to own it. That is what a culture of reliability requires.