How do we find the needle in the haystack in the logs? Even worse, how do we find the needle in a fleet of haystacks? It’s arduous to collect all the logs across a fleet of hosts. Then slothing through the mountain that is each log file looking for the magic words that means impending failure. All the while ignoring the irrelevant debug messages that worked their way in. And sure enough, as soon as we finish, the data is stale and we need to begin again. This is definitely no way to keep track of a fleet of online properties.
Automating this with tools like Ansible only goes so far. The operator needs to know to look for an issue. Most of the time, they’ll run the Ansible command only to have it come back clean. “Should I really run this every day when I’ve gotten no value from it in months? There are so many more urgent tasks,” says the Ops team. And suddenly it’s been months since we checked. And then there’s the dreaded ticket from a user complaining of data loss. Oops.
This Shoreline Runbook allows querying all the logs across the entire fleet. Add in your regular expression or log text, and it harvests the quantity of errors in each node. Now you know exactly where the errors are, and how frequently they occur.
You can use this runbook as a template for your own creations, tailoring the query specifically to the problems you have difficulty finding. Or use this runbook to search for the cause of a failure or the blast radius of an outage or breach by looking through all the logs for the magic words at once.