When a disk starts to go bad, it’s really important to act fast. We may only have a few more reads left until it stops spinning or the chips fail. Even worse, if we don’t know it’s happening, the data could be irretrievably lost without warning.
It’s like looking for a needle in a fleet of haystacks. It’s arduous to collect all the logs across a fleet of hosts. Then slothing through the mountain that is each log file looking for the magic words that means impending failure. All the while ignoring the irrelevant debug messages that worked their way in. And sure enough, as soon as we finish the data is stale and we need to begin again. This is definitely no way to keep track of a fleet of online properties.
Automating this with tools like Ansible only goes so far. The operator needs to know to look for an issue. Most of the time, they’ll run the Ansible command only to have it come back clean. “Should I really run this every day when I’ve gotten no value from it in months? There are so many more urgent tasks,” says the Ops team. And suddenly it’s been months since we checked. And then there’s the dreaded ticket from a user complaining of data loss. Oops.
Shoreline has an agent on each host specifically looking through kern.log, matching a specific regular expression, looking for problematic log statements. Are SMART errors starting to build up? Is write speed suddenly and dramatically increasing? Are we getting CRC errors in files? If the matching log is found, an alert is raised in the Shoreline dashboard, and optionally in an external ticketing system, If configured, Shoreline can immediately start copying critical business information to another network location.