Back to videos

How to Manage Failure without Wasting Resources

How can you better utilize the resources you keep aside for failover purposes? Here's how we utilized resources kept just for failover purposes to do things that could be stopped for some time when a failure happens and had resources doing useful background activity that can be deferred to when things hit the fan.
3 min
play_arrow
Summary

How can you better utilize the resources you keep aside for failover purposes?

Here’s how I've approached this in the past:

When I was designing Amazon Aurora, I made the storage regional.

So we had 2 copies of data in each of the 3 availability zones.

That meant that as long as I could get another database instance up, I didn't have to constantly replicate data because it was happening behind the scenes in the storage layer.

But I might not be able to get a second database instance because everyone else is asking for one.

So we made another instance available, but it acted as a read replica, where we could divert read traffic to it rather than read-write traffic.

This way, it wasn't just sitting idle but getting used for live customer requests and maybe letting you tick down the size of your instance.

That’s how we utilized resources that were kept just for failover purposes to do things that could be stopped for some time when a failure happens.

Another example of this is from a friend who once used to run an entire 2nd data center just in case the 1st one failed.

That’s super expensive, but now they do something brilliant with it.

They use those resources to run AI modeling jobs on the systems at the 2nd center.

If a region goes down, they can stop running those training models for a period and instead run user traffic on that.

That's another way you can have resources doing useful background activity that can be deferred when things hit the fan.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
2 min
Shoreline Incident Insights
A quick overview video that shows automated categorization, filtering, and analysis of incidents.
1 min
Shoreline on Shoreline: Open Port Check
It's critical to close ports like 22 and 3389 that can be opened unintentionally in a development environment
5 min
A Guide to Building Reliable Systems
When designing reliable systems, you need to look at correlated events and their downstream impacts, the time it takes to repair them, and the breadth of the system being applied to.