Back to videos

Building a Culture Around Reliability

It's not some other team's job to keep your service up. Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities. We all have to own it. That is what a culture of reliability requires.
3 min
play_arrow
Summary

Let's talk about building a culture around reliability.

I'm famous in my teams for telling them, “The currency of management is attention” – what you pay attention to is what your reports are going to pay attention to.

I firmly believe that improving operational excellence starts with your culture.

At AWS, we had weekly operations meetings a couple of hours long, led by heads of various services and experts who knew reliability best.

We'd review the prior week's outages, ongoing campaigns, etc.

It showed that it was important to the company because people were spending time on it.

But many companies try to solve operational challenges by assigning them to a specific team or a reliability tzar.

That removes ownership and accountability from everybody else.

One of the few advantages of being an old guy is that I've seen this story play out before.

I remember failed quality tzar and security tzar initiatives.

They failed because they were fundamentally saying that this is not important enough to be part of everybody's job.

We got those things to succeed by making it a part of everyone’s responsibility.

For example, by having everyone put in a unit test as part of the code review process.

We need to do the same thing for reliability.

No one wants to do on-call.

So you can:
- toss the problem to some “second-tier team” (even if you don't call them that)
- OR you can make it part of everybody's job.

We know which one is going to improve reliability.

That doesn't mean you don't have specialists on the team.

But it isn't some other team's job to keep your service up.

Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities.

We all have to own it. That is what a culture of reliability requires.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
2 min
Shoreline Incident Automation Overview
Shoreline’s Incident Automation Platform was built to reduce manual and repetitive work, so that you can repair issues faster, increase team productivity, and eliminate thousands of hours of degraded service.
3 min
Decoding Taylor Swift’s Ticketmaster Debacle
What can we learn from the Ticketmaster (Taylor Swift) Debacle? Ticketmaster experienced an unprecedented demand that resulted in their site crashing for many hours. If they had designed a reliable service with an escalator-like system instead of an elevator, this could have been avoided.
4 min
Why I Started Shoreline
Companies spend more on the people managing their cloud infrastructure than on the cloud infrastructure itself.