Back to videos

A Guide to Building Reliable Systems

When designing reliable systems, you need to look at correlated events and their downstream impacts, the time it takes to repair them, and the breadth of the system being applied to.
5 min
play_arrow
Summary

Let’s talk about Amazon S3’s 11 nines claim.

Amazon is certainly one of the best, most impressive organizations that I know of when it comes to this stuff.

But 11 nines implies that if you save a photo, it will stay there for ~100 billion years.

But the sun is going to envelop the earth in about 5-7 billion years, and I don’t think US-East-2 will survive that event!

S3 claims to achieve 11 nines by making 33 copies of data and storing 11 copies each across three data centers.

This allows for the tolerance of 11 individual disk drive failures.

However, this approach doesn't account for correlated failures, such as a data center going down due to a fire or natural disaster, which would result in the loss of multiple copies of data.

When designing Aurora, we stored two copies of data in the three data centers (even though most other systems kept one copy in each center).It’s because I was looking at my largest likely correlated failure, which would be a data center going down.

So when that happens, every single one of my databases is going to get two failures for that duration of time.

Now, some subsets are going to have another failure somewhere else while it takes me time to repair the first 2 failures.

So if two out of three go down, I’m left with only one copy. And I can't trust whether that copy is up to date or not, which means that my database is corrupt.

But if I'm doing four out of six and I get down to three out of six, I can still read it and do the repair.

So when designing your systems, you need to think about:
- the largest probable correlated event,
- associate it with the independent events that could be already going on,
- multiply that by the number of such things going on in your environment, and then
- divide it by the duration over which that's going to happen.

For example, if it takes me 10 seconds to repair a segment in Aurora, I'm basically looking for a 10-second period for the independent failures against the correlated failure.

You want to bring that number down as far as you can in an economically reasonable way.

For us, that ended up being four out of six.

For you, it might be a different number.

To find that, the factors you need to look at are:
- your correlated events
- their downstream impacts
- the time it takes to repair them
- the breadth of the system being applied to

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
4 min
How to Solve the Challenges of MELT Data at Scale
The bigger the data set, the slower it is to analyze. For MELT, you need to be able to execute a query at scale across your fleet and see what's going on in the live environment. That’s why, at Shoreline, we favor modeling the distributed system as a distributed system.
2 min
Why We Leverage Wavelets for Data Compression
Wavelets are the best way to deal with errors in the underlying data stream
2:40 min
How to Do Continuous Improvement in Operations
Things that enabled me to do more with lower cloud computing costs