Back to videos

A Guide to Building Reliable Systems

When designing reliable systems, you need to look at correlated events and their downstream impacts, the time it takes to repair them, and the breadth of the system being applied to.
5 min
play_arrow
Summary

Let’s talk about Amazon S3’s 11 nines claim.

Amazon is certainly one of the best, most impressive organizations that I know of when it comes to this stuff.

But 11 nines implies that if you save a photo, it will stay there for ~100 billion years.

But the sun is going to envelop the earth in about 5-7 billion years, and I don’t think US-East-2 will survive that event!

S3 claims to achieve 11 nines by making 33 copies of data and storing 11 copies each across three data centers.

This allows for the tolerance of 11 individual disk drive failures.

However, this approach doesn't account for correlated failures, such as a data center going down due to a fire or natural disaster, which would result in the loss of multiple copies of data.

When designing Aurora, we stored two copies of data in the three data centers (even though most other systems kept one copy in each center).It’s because I was looking at my largest likely correlated failure, which would be a data center going down.

So when that happens, every single one of my databases is going to get two failures for that duration of time.

Now, some subsets are going to have another failure somewhere else while it takes me time to repair the first 2 failures.

So if two out of three go down, I’m left with only one copy. And I can't trust whether that copy is up to date or not, which means that my database is corrupt.

But if I'm doing four out of six and I get down to three out of six, I can still read it and do the repair.

So when designing your systems, you need to think about:
- the largest probable correlated event,
- associate it with the independent events that could be already going on,
- multiply that by the number of such things going on in your environment, and then
- divide it by the duration over which that's going to happen.

For example, if it takes me 10 seconds to repair a segment in Aurora, I'm basically looking for a 10-second period for the independent failures against the correlated failure.

You want to bring that number down as far as you can in an economically reasonable way.

For us, that ended up being four out of six.

For you, it might be a different number.

To find that, the factors you need to look at are:
- your correlated events
- their downstream impacts
- the time it takes to repair them
- the breadth of the system being applied to

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
Is Automation Too Time-Consuming?
Automation takes us too much time. The problem with this approach is that 48% of incidents are straightforward and repetitive. Don't have people fix them manually. Teach the computer how to do it.
14 min
theCUBE Interviews Shoreline CEO Anurag Gupta at AWS re:Invent
Anurag Gupta joined John Walls to discuss innovation in the cloud with DevOps teams for the Global Startup Program at AWS re:Invent 2022.
2 min
Shoreline Incident Automation Overview
Shoreline’s Incident Automation Platform was built to reduce manual and repetitive work, so that you can repair issues faster, increase team productivity, and eliminate thousands of hours of degraded service.