Network Failures in kern.log

Detect when a network interface has errors or has entirely failed by inspecting the OS’s kern.log. Automatically capture these events and initiate fixes such as recycling the VM.

Major outage

The problem

When you experience downtime or a service interruption, it’s easy to see that something is broken. But it's never easy to see if that broken element lives at the infrastructure level or application level. Even though your observability and monitoring tools can alert you that an issue exists, their alarms typically aren’t specific enough to nail down the cause. Searching for that needle in a haystack can be a long and arduous process, causing expensive delays on the way to fixing the issue. This is especially true for any business that operates multiple Kubernetes clusters and administers systems built on top of microservices. 

With so many things — in so many places — that could be broken, how do you efficiently search for the source of an issue and implement a solution?

For many, it’s an inefficient manual process. A problem occurs, and engineers then spend hours searching Stack Overflow for answers or trying a series of random commands to diagnose and repair. The worst part about this process? It’s an unreported waste of time. Most teams don’t account for time spent searching for a solution when looking holistically at how long it took to solve an issue. 

Sure, manually running a command to fix an issue doesn’t take long. But recalling which commands to run — and in what order — to manually diagnose an issue is the silent killer of your team’s productivity.

The solution

Shoreline has an agent on each host specifically looking through kern.log, matching a specific regular expression, looking for problematic log statements. Are we getting CRC errors on network requests? Is network read or write speed suddenly and dramatically increasing on a particular interface? If the matching log is found, an alert is raised in the Shoreline dashboard, and optionally in an external ticketing system, If configured, Shoreline can immediately start copying critical business information to another network location.

Highlights

Customer experience impact
Potential hours of downtime
High
Occurrence frequency
Until the root cause is identified
High
Shoreline time to repair
1-2 minutes
Low
Time to diagnose manually
Toil of collecting, aggregating, and filtering logs
HIGH
Security
Cost impact
Time to repair manually
5 manual hours
High

Related Solutions