Back to videos

How to Boost Reliability Without Hiring More SREs

How can companies increase reliability without hiring an army of engineers?
3 min
play_arrow
Summary

How can companies increase reliability without hiring an army of engineers? You'll never be able to hire people at the same rate that your fleets grow.Here's why:

  • It's hard to hire SREs right now.

There are about as many job openings on LinkedIn for SREs as for developers, even though there are 40 times more developers.

SREs have a high churn. They leave really quickly, within 18 months on average. So keeping them and replacing them is a big challenge.

The only real solution to this is to be able to bend the curve: to manage large-scale infrastructure with fewer people. I started Shoreline to solve this exact problem. Here are 3 things we do:

1. We make it easy to automate issues away. By doing so, we reduce the toil due to mundane commonplace issues. We believe that people shouldn't wake up in the middle of the night to do things the machine can do.

2. We make it possible to safely expand the group of people who can fix things without escalation. So you can bring in your support and dev teams to take care of many things previously only handled by SRE experts. You'll still need experts, but not as many because they won't be on every single issue. We do that by delivering Jupiter-like notebooks that populate with diagnostic information as soon as an incident occurs and provide the recipe to fix things. Unlike static wikis that become stale, these notebooks are executable, so people are motivated to keep them up to date.

3. We make debugging across the fleet similar in time to debugging an individual node. We do this by enabling parallel distributed execution.So even if there are 100 or 1,000 nodes, you can ask questions like:

  • Are any unexpected processes running on my nodes?
  • Are the configurations what I expect, or have some of them drifted away?

At the core, this is all about:

  • making people more productive by automating a big part of the work,
  • spreading the load across more people, and
  • debugging in constant time, not in time proportional to your fleet size.
Transcript

View more Shoreline videos

Looking for more? View our most recent videos
1 min
Shoreline Operations Notebooks
Record, curate, and publish incident debug and repair best practices to safely empower on-call teams.
3 min
Why You Need Automation Today
A ton of tools help you observe your environment and maybe half a ton help you route things and deduplicate them. But there's hardly anything out there that actually fixes your environment. That's the reason we need automation in production ops today.
2 min
About Company Values
Part of the reason to create a company is to create the environment you want to be in.So it’s important that you reflect your values in your interview process. Otherwise, the sheer number of people joining will dilute things.