When hard disks fill up they can cause catastrophic app failures that can require a lot of time and effort to recover. Disk full incidents can lead to wide-spread outages and data loss that can seriously damage customer experiences and lead to lost revenue.
When managing a large fleet, it can be tricky to keep track of every node and to identify those nodes that are reaching capacity. These types of issues can come up on a weekly basis for larger fleets, especially if the fleet is supporting a wide variety of customers, each with several nodes.
Recovering from a disk full incident can also take hours. In some cases, the disk must be physically disconnected from the original server and attached to another server so that it can be repaired. In other cases, an SRE must perform a series of checks to understand why the disk is filling. Often there are multiple potential causes so the root cause analysis can take several hours.
With Shoreline, it is easy to create an alarm that regularly checks disk capacity across every node in your fleet. Once a filling disk is detected, companies can select from two options to address the problem: