Back to videos

How to Boost Reliability Without Hiring More SREs

How can companies increase reliability without hiring an army of engineers?
3 min
play_arrow
Summary

How can companies increase reliability without hiring an army of engineers? You'll never be able to hire people at the same rate that your fleets grow.Here's why:

  • It's hard to hire SREs right now.

There are about as many job openings on LinkedIn for SREs as for developers, even though there are 40 times more developers.

SREs have a high churn. They leave really quickly, within 18 months on average. So keeping them and replacing them is a big challenge.

The only real solution to this is to be able to bend the curve: to manage large-scale infrastructure with fewer people. I started Shoreline to solve this exact problem. Here are 3 things we do:

1. We make it easy to automate issues away. By doing so, we reduce the toil due to mundane commonplace issues. We believe that people shouldn't wake up in the middle of the night to do things the machine can do.

2. We make it possible to safely expand the group of people who can fix things without escalation. So you can bring in your support and dev teams to take care of many things previously only handled by SRE experts. You'll still need experts, but not as many because they won't be on every single issue. We do that by delivering Jupiter-like notebooks that populate with diagnostic information as soon as an incident occurs and provide the recipe to fix things. Unlike static wikis that become stale, these notebooks are executable, so people are motivated to keep them up to date.

3. We make debugging across the fleet similar in time to debugging an individual node. We do this by enabling parallel distributed execution.So even if there are 100 or 1,000 nodes, you can ask questions like:

  • Are any unexpected processes running on my nodes?
  • Are the configurations what I expect, or have some of them drifted away?

At the core, this is all about:

  • making people more productive by automating a big part of the work,
  • spreading the load across more people, and
  • debugging in constant time, not in time proportional to your fleet size.
Transcript

View more Shoreline videos

Looking for more? View our most recent videos
2 min
About Shoreline’s Fleet-Wide Debugging and Repair
Shoreline enables highly targeted fleet-wide debugging and repair allowing you to debug across the fleet in about the same amount of time as an individual box.
3 min
Decoding Taylor Swift’s Ticketmaster Debacle
What can we learn from the Ticketmaster (Taylor Swift) Debacle? Ticketmaster experienced an unprecedented demand that resulted in their site crashing for many hours. If they had designed a reliable service with an escalator-like system instead of an elevator, this could have been avoided.
4 min
Why I Started Shoreline
Companies spend more on the people managing their cloud infrastructure than on the cloud infrastructure itself.