Back to videos

A Guide to Building Reliable Systems

When designing reliable systems, you need to look at correlated events and their downstream impacts, the time it takes to repair them, and the breadth of the system being applied to.
5 min
play_arrow
Summary

Let’s talk about Amazon S3’s 11 nines claim.

Amazon is certainly one of the best, most impressive organizations that I know of when it comes to this stuff.

But 11 nines implies that if you save a photo, it will stay there for ~100 billion years.

But the sun is going to envelop the earth in about 5-7 billion years, and I don’t think US-East-2 will survive that event!

S3 claims to achieve 11 nines by making 33 copies of data and storing 11 copies each across three data centers.

This allows for the tolerance of 11 individual disk drive failures.

However, this approach doesn't account for correlated failures, such as a data center going down due to a fire or natural disaster, which would result in the loss of multiple copies of data.

When designing Aurora, we stored two copies of data in the three data centers (even though most other systems kept one copy in each center).It’s because I was looking at my largest likely correlated failure, which would be a data center going down.

So when that happens, every single one of my databases is going to get two failures for that duration of time.

Now, some subsets are going to have another failure somewhere else while it takes me time to repair the first 2 failures.

So if two out of three go down, I’m left with only one copy. And I can't trust whether that copy is up to date or not, which means that my database is corrupt.

But if I'm doing four out of six and I get down to three out of six, I can still read it and do the repair.

So when designing your systems, you need to think about:
- the largest probable correlated event,
- associate it with the independent events that could be already going on,
- multiply that by the number of such things going on in your environment, and then
- divide it by the duration over which that's going to happen.

For example, if it takes me 10 seconds to repair a segment in Aurora, I'm basically looking for a 10-second period for the independent failures against the correlated failure.

You want to bring that number down as far as you can in an economically reasonable way.

For us, that ended up being four out of six.

For you, it might be a different number.

To find that, the factors you need to look at are:
- your correlated events
- their downstream impacts
- the time it takes to repair them
- the breadth of the system being applied to

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
Shoreline Makes Production-Ops Smarter and Faster
Often people try to build a solution like Shoreline on their own. Here's why they fail.
17 min
[Training] Debugging Kubernetes with Runbooks
In this training, we walk you through the common issues and challenges troubleshooting Kubernetes, and Shoreline's pre-built K8s debugging runbooks.
2 min
How to Bring Continuous Improvement in Operations
I deeply believe in making things 1% better each and every week by improving the performance of the software I've been responsible for and keeping my services up. Let’s talk about bringing continuous improvement to operations.