When you experience downtime or a service interruption, it’s easy to see that something is broken. But it's never easy to see if that broken element lives at the infrastructure level or application level. Even though your observability and monitoring tools can alert you that an issue exists, their alarms typically aren’t specific enough to nail down the cause. Searching for that needle in a haystack can be a long and arduous process, causing expensive delays on the way to fixing the issue. This is especially true for any business that operates multiple Kubernetes clusters and administers systems built on top of microservices.
With so many things — in so many places — that could be broken, how do you efficiently search for the source of an issue and implement a solution?
For many, it’s an inefficient manual process. A problem occurs, and engineers then spend hours searching Stack Overflow for answers or trying a series of random commands to diagnose and repair. The worst part about this process? It’s an unreported waste of time. Most teams don’t account for time spent searching for a solution when looking holistically at how long it took to solve an issue.
Sure, manually running a command to fix an issue doesn’t take long. But recalling which commands to run — and in what order — to manually diagnose an issue is the silent killer of your team’s productivity.
Shoreline has an agent on each host specifically looking through kern.log, matching a specific regular expression, looking for problematic log statements. Are we getting CRC errors on network requests? Is network read or write speed suddenly and dramatically increasing on a particular interface? If the matching log is found, an alert is raised in the Shoreline dashboard, and optionally in an external ticketing system, If configured, Shoreline can immediately start copying critical business information to another network location.