Disk Resize/Disk Clean

Disk full incidents can lead to wide-spread outages and data loss that can damage customer experiences and lose revenue.

Major outage

The problem

When hard disks fill up they can cause catastrophic app failures that can require a lot of time and effort to recover. Disk full incidents can lead to wide-spread outages and data loss that can seriously damage customer experiences and lead to lost revenue.

When managing a large fleet, it can be tricky to keep track of every node and to identify those nodes that are reaching capacity. These types of issues can come up on a weekly basis for larger fleets, especially if the fleet is supporting a wide variety of customers, each with several nodes.

Recovering from a disk full incident can also take hours. In some cases, the disk must be physically disconnected from the original server and attached to another server so that it can be repaired. In other cases, an SRE must perform a series of checks to understand why the disk is filling. Often there are multiple potential causes so the root cause analysis can take several hours.

The solution

With Shoreline, it is easy to create an alarm that regularly checks disk capacity across every node in your fleet. Once a filling disk is detected, companies can select from two options to address the problem:

  1. Resize the disk by adding a preset number of gigabytes, up to a pre-set limit. If the limit is reached, then the on-call team is alerted
  2. Back up a set of files to a cloud storage service like S3 and then delete the files In both of the above scenarios, the disk is automatically resized without human intervention and the entire event is tracked in Shoreline’s audit trails and dashboards. This allows customers to see how often these situations occur and when necessary, ask engineering to fix an underlying issue in the application.

Highlights

Customer experience impact
Potential hours of downtime
High
Occurrence frequency
Weekly for fleets with many nodes
High
Shoreline time to repair
1-2 minutes to repair
Low
Time to diagnose manually
Security
Cost impact
Time to repair manually
1-6 manual hours to repair
High

Related Solutions