Back to videos

How to Manage Failure without Wasting Resources

How can you better utilize the resources you keep aside for failover purposes? Here's how we utilized resources kept just for failover purposes to do things that could be stopped for some time when a failure happens and had resources doing useful background activity that can be deferred to when things hit the fan.
3 min
play_arrow
Summary

How can you better utilize the resources you keep aside for failover purposes?

Here’s how I've approached this in the past:

When I was designing Amazon Aurora, I made the storage regional.

So we had 2 copies of data in each of the 3 availability zones.

That meant that as long as I could get another database instance up, I didn't have to constantly replicate data because it was happening behind the scenes in the storage layer.

But I might not be able to get a second database instance because everyone else is asking for one.

So we made another instance available, but it acted as a read replica, where we could divert read traffic to it rather than read-write traffic.

This way, it wasn't just sitting idle but getting used for live customer requests and maybe letting you tick down the size of your instance.

That’s how we utilized resources that were kept just for failover purposes to do things that could be stopped for some time when a failure happens.

Another example of this is from a friend who once used to run an entire 2nd data center just in case the 1st one failed.

That’s super expensive, but now they do something brilliant with it.

They use those resources to run AI modeling jobs on the systems at the 2nd center.

If a region goes down, they can stop running those training models for a period and instead run user traffic on that.

That's another way you can have resources doing useful background activity that can be deferred when things hit the fan.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
6 min
Automation Anywhere Connects Sumo Logic with Shoreline for Auto-remediation
Automaton Anywhere links Sumo Logic's data and log monitoring with Shoreline's automated incident repairs to improve customer experiences and save Dev time
1 min
Shoreline Fleetwide Debugging
Run a single command across the entire fleet to diagnose incidents more quickly.
2 min
3 Challenges of Meeting 4 Nines Availability
Availability for the 4 nines is equivalent to only 4.4 minutes of downtime in a month. Here are 3 challenges that keep people from meeting customer expectations for service availability.