Back to videos

Building a Culture Around Reliability

It's not some other team's job to keep your service up. Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities. We all have to own it. That is what a culture of reliability requires.
3 min
play_arrow
Summary

Let's talk about building a culture around reliability.

I'm famous in my teams for telling them, “The currency of management is attention” – what you pay attention to is what your reports are going to pay attention to.

I firmly believe that improving operational excellence starts with your culture.

At AWS, we had weekly operations meetings a couple of hours long, led by heads of various services and experts who knew reliability best.

We'd review the prior week's outages, ongoing campaigns, etc.

It showed that it was important to the company because people were spending time on it.

But many companies try to solve operational challenges by assigning them to a specific team or a reliability tzar.

That removes ownership and accountability from everybody else.

One of the few advantages of being an old guy is that I've seen this story play out before.

I remember failed quality tzar and security tzar initiatives.

They failed because they were fundamentally saying that this is not important enough to be part of everybody's job.

We got those things to succeed by making it a part of everyone’s responsibility.

For example, by having everyone put in a unit test as part of the code review process.

We need to do the same thing for reliability.

No one wants to do on-call.

So you can:
- toss the problem to some “second-tier team” (even if you don't call them that)
- OR you can make it part of everybody's job.

We know which one is going to improve reliability.

That doesn't mean you don't have specialists on the team.

But it isn't some other team's job to keep your service up.

Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities.

We all have to own it. That is what a culture of reliability requires.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
2 min
The Best Way to Improve Your On-Call
No one wants to do on-call because you can't control when the incident happens. Improve your on-call by building automations that eliminate common production incidents.
3 min
Why You Need Automation Today
A ton of tools help you observe your environment and maybe half a ton help you route things and deduplicate them. But there's hardly anything out there that actually fixes your environment. That's the reason we need automation in production ops today.
3 min
How to Manage Your Operational Data Efficiently
"How long should we keep operational data?"