Disk Resize/Disk Clean

Disk full incidents can lead to wide-spread outages and data loss that can damage customer experiences and lose revenue.

The problem

When hard disks fill up they can cause catastrophic app failures that can require a lot of time and effort to recover. Disk full incidents can lead to wide-spread outages and data loss that can seriously damage customer experiences and lead to lost revenue.

When managing a large fleet, it can be tricky to keep track of every node and to identify those nodes that are reaching capacity. These types of issues can come up on a weekly basis for larger fleets, especially if the fleet is supporting a wide variety of customers, each with several nodes.

Recovering from a disk full incident can also take hours. In some cases, the disk must be physically disconnected from the original server and attached to another server so that it can be repaired. In other cases, an SRE must perform a series of checks to understand why the disk is filling. Often there are multiple potential causes so the root cause analysis can take several hours.

The solution

With Shoreline, it is easy to create an alarm that regularly checks disk capacity across every node in your fleet. Once a filling disk is detected, companies can select from two options to address the problem:

Resize the disk by adding a preset number of gigabytes, up to a pre-set limit. If the limit is reached, then the on-call team is alerted
Back up a set of files to a cloud storage service like S3 and then delete the files In both of the above scenarios, the disk is automatically resized without human intervention and the entire event is tracked in Shoreline’s audit trails and dashboards. This allows customers to see how often these situations occur and when necessary, ask engineering to fix an underlying issue in the application.

Highlights

Customer experience impact

Potential hours of downtime

High

Occurrence frequency

Weekly for fleets with many nodes

High

Shoreline time to repair

1-2 minutes to repair

Low

Time to diagnose manually

Security

Cost impact

Time to repair manually

1-6 manual hours to repair

Disk Resize/Disk Clean

Disk full incidents can lead to wide-spread outages and data loss that can damage customer experiences and lose revenue.

The problem

The solution

Highlights

High

High

Low

High

Related Solutions

Give us two weeks and we'll show you how to eliminate 30% of your incidents.