Disk Failures in kern.log

Detect when a disk has errors or has entirely failed by inspecting the OS’s kern.log. Automatically capture these events and kick off fixes such as recycling the VM.

Major outage

The problem

When a disk starts to go bad, it’s really important to act fast. We may only have a few more reads left until it stops spinning or the chips fail.  Even worse, if we don’t know it’s happening, the data could be irretrievably lost without warning.

It’s like looking for a needle in a fleet of haystacks.  It’s arduous to collect all the logs across a fleet of hosts.  Then slothing through the mountain that is each log file looking for the magic words that means impending failure.  All the while ignoring the irrelevant debug messages that worked their way in.  And sure enough, as soon as we finish the data is stale and we need to begin again.  This is definitely no way to keep track of a fleet of online properties.

Automating this with tools like Ansible only goes so far.  The operator needs to know to look for an issue.  Most of the time, they’ll run the Ansible command only to have it come back clean.  “Should I really run this every day when I’ve gotten no value from it in months? There are so many more urgent tasks,” says the Ops team.  And suddenly it’s been months since we checked.  And then there’s the dreaded ticket from a user complaining of data loss.  Oops.

The solution

Shoreline has an agent on each host specifically looking through kern.log, matching a specific regular expression, looking for problematic log statements. Are SMART errors starting to build up? Is write speed suddenly and dramatically increasing? Are we getting CRC errors in files? If the matching log is found, an alert is raised in the Shoreline dashboard, and optionally in an external ticketing system, If configured, Shoreline can immediately start copying critical business information to another network location.

Highlights

Customer experience impact
Potential hours of downtime
High
Occurrence frequency
Until the root cause is identified
High
Shoreline time to repair
1-2 minutes
Low
Time to diagnose manually
Toil of collecting, aggregating, and filtering logs
HIGH
Security
Cost impact
Time to repair manually
1-2 manual hours
High

Related Solutions