Back to videos

Building a Culture Around Reliability

It's not some other team's job to keep your service up. Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities. We all have to own it. That is what a culture of reliability requires.
3 min
play_arrow
Summary

Let's talk about building a culture around reliability.

I'm famous in my teams for telling them, “The currency of management is attention” – what you pay attention to is what your reports are going to pay attention to.

I firmly believe that improving operational excellence starts with your culture.

At AWS, we had weekly operations meetings a couple of hours long, led by heads of various services and experts who knew reliability best.

We'd review the prior week's outages, ongoing campaigns, etc.

It showed that it was important to the company because people were spending time on it.

But many companies try to solve operational challenges by assigning them to a specific team or a reliability tzar.

That removes ownership and accountability from everybody else.

One of the few advantages of being an old guy is that I've seen this story play out before.

I remember failed quality tzar and security tzar initiatives.

They failed because they were fundamentally saying that this is not important enough to be part of everybody's job.

We got those things to succeed by making it a part of everyone’s responsibility.

For example, by having everyone put in a unit test as part of the code review process.

We need to do the same thing for reliability.

No one wants to do on-call.

So you can:
- toss the problem to some “second-tier team” (even if you don't call them that)
- OR you can make it part of everybody's job.

We know which one is going to improve reliability.

That doesn't mean you don't have specialists on the team.

But it isn't some other team's job to keep your service up.

Just like it's not some other team's job to fix your bugs or make sure that your system doesn't have vulnerabilities.

We all have to own it. That is what a culture of reliability requires.

Transcript

View more Shoreline videos

Looking for more? View our most recent videos
3 min
How to Manage Failure without Wasting Resources
How can you better utilize the resources you keep aside for failover purposes? Here's how we utilized resources kept just for failover purposes to do things that could be stopped for some time when a failure happens and had resources doing useful background activity that can be deferred to when things hit the fan.
2 min
How to Bring Continuous Improvement in Operations
I deeply believe in making things 1% better each and every week by improving the performance of the software I've been responsible for and keeping my services up. Let’s talk about bringing continuous improvement to operations.
2 min
Why We Leverage Wavelets for Data Compression
Wavelets are the best way to deal with errors in the underlying data stream