Why I Built Shoreline Incident Automation

Introduction

tl;dr -- The increasing fleet size and complexity of production environments has created an explosion in on-call incidents. You can dramatically reduce on-call fatigue and improve availability using Shoreline’s incident automation platform.

In the beginning, there was…

Waterfall. Back in those days, I’d spend two months writing a design doc, four months coding, and maybe wait another six months for the release.

The Internet and SaaS changed all that. Suddenly, we were releasing once a week. Faster releases meant testing needed to be done faster and better, leading to QA automation using Agile and pipelines.

The cloud moved us to micro-services, 10x larger fleets, and 10x more deployments. This required automating configuration and deployment, using CI/CD, GitOps and Infrastructure-as-Code.

At AWS, where I ran transactional database and analytic services, on any given day, we’d be doing 6-12 production deployments of services to the hosts in our AZs and Regions. In 2014, Werner Vogels disclosed that Amazon was doing 50M deployments each year to development, testing, and production hosts. By now, everyone knows how to approach automating deployments.

So, why does production operations feel harder than ever?

Production operations is still manual

You can’t choose when a disk fills up, a JVM goes into a hard GC loop, a certificate expires, or any of the thousand issues that happen in production operations. Keeping the lights on is a 24x7 problem.

It’s really tough. At AWS, I’d see our fleets growing far faster than the service teams operating them. Without automation to squash tickets, on-call would grow longer and ticket queues deeper. It’s not just AWS - my friends at GCP and Azure describe the same thing. If they’re struggling with this, what chance do the rest of us have?

The problem Shoreline solves

There are many observability and incident management tools out there. They’re good at telling you what’s going on in your systems and helping you bring together the people to fix them. These are necessary parts of your production ops toolchain.

But I never got excited by one more dashboard to look at or one more process optimization tool telling me what to do next. I did get excited when someone told me that an issue we would see again and again had now been automated away.

To automate incidents, we need to solve two big problems. For new issues, we often lack telemetry and SSHing into node after node to find the needle in the haystack takes time we don’t have. With repetitive issues, safely automating the repair can be a months-long dev project and who has time when there are hundreds of such issues out there?

How we solve the problem

We start with the belief that operators know how to administer a single box - the challenge is extending that to diagnose and repair a large fleet. We created an elegant DSL called Op that provides a simple pipe delimited syntax, integrating real-time resources and metrics with the ability to execute anything you can run at the Linux command prompt. Now, you can run simple one-liners to debug and fix your fleet in about the same time as a single box. You tell us what to do, we figure out how to run it in parallel, distributed across your fleet.

We also made it easy to create remediation loops that check for issues, collect diagnostics, and apply repairs automatically in the background on your hosts - each and every second! These are defined using the same Op language used in incident debugging. This eliminates the difference between in-the-moment debugging and automation. That matters because, to make a meaningful dent in repetitive incidents, it can’t take longer to fix something once and for all than it takes to fix it once.

Over time, you’ll build a collective memory for production ops since the resource queries, metric queries, and actions are all named and easily accessible right in the UI and CLI your operators use to debug, repair, and automate.

Why doesn’t it exist already?

When I describe what we’re doing here at Shoreline, people often ask me why it doesn’t exist already. I get it. Shoreline is the tool I wish I’d had at AWS - it would have saved endless hours of repetitive work.

It’s actually a really hard problem to solve. Probably the hardest I’ve worked on in my career! Let’s look at some of the problems you need to solve...

You need to trust the data. We do this by providing per-second metrics with no lag. We leverage the Prometheus exporter ecosystem but scrape metrics each second on-box so we can distribute queries to get real-time data and run local loops to automate repairs.
Data collection and storage can’t be expensive. We use JAX and XLA to reduce the CPU required to compute metrics and alarms. We use wavelets to obtain 30x compression on the input metric stream. And, we partition queries across historical data stored inexpensively in S3, recent data on our backend, and real-time data being collected at each node.
It has to be easy to define queries and to automate repairs. You need a fluent integration between resources, metrics, and Linux commands - the nouns, adjectives, and verbs of ops. Op is a familiar syntax for anyone who knows shell and you can bring in the scripts and one-liners you already use.
It needs to be fast. We wrote a planner that takes the simple Op statements operators write and turn them into a distributed execution graph that runs in parallel across your fleet.
It needs to work even when other things are broken. All Shoreline components have been designed to be fault-tolerant. Automation runs locally even when connectivity is lost to Shoreline backends. All processes have supervisors to restart when failing and re-establish connectivity. We provide partial results when portions of the system are unavailable and show exactly which hosts are not reachable.
It needs to be safe. Some of the largest outages I’ve seen are when an operator took too widespread an action or an automated fix snowballed to create a much larger problem than it was fixing. Shoreline provides controls to limit the scope of a manual operator action and automated executions in a given time period.

There’s a lot more inside Shoreline, but that should give you a sense of the platform.

Aren’t these just bandaids?

Automated remediation is no substitute for root cause analysis. But, you need to alleviate the immediate problem faced by your customers while you schedule root cause repair by your dev team. If you go to the ER with a heart attack, it’s the wrong time to hear about your diet or high cholesterol.

And, operator fatigue is a real problem. Repetitive work has a 1% error rate done by hand. Doing this work manually while waiting for a code fix has meaningful risk. That’s why we automated testing, configuration, and deployment - that’s why we also need to automate repetitive incidents.

Shoreline also makes root-causing transient issues much easier by capturing all the debugging information whenever a bad condition is observed.

The future of Shoreline

Our goal at Shoreline is to radically increase system availability and reduce operator toil through incident automation. Today, we’re launching publicly, but we’re just getting started. We’ll keep improving our automation engine and language. We have support for Kubernetes and AWS VMs and are currently developing Azure and GCP support - we’re eager to provide cross-cloud debugging and automation! We are developing out-of-the-box Op Packs to help with the commonplace issues seen by operators - we have several already and look forward to developing a hundred more so operators at one company benefit from what others have already seen.

Want to learn more or try it out? Reach out to me at anurag@shoreline.io.