The problem

Fluentd is a great logging system, aggregating all the logs from mission critical systems into one data stream we can pipe to anywhere. It’s a great source for application diagnostic data … if it’s running correctly.

If the logging platform slows down, anything sending logs synchronously slows down too, causing customer pain, and potentially causing system outages. For all the tools that depend on this diagnostic data, they can’t know of any problem if the problem is the logging system itself.

The solution

This Shoreline automation monitors the liveness probe and health check on the Fluentd containers. If a pod is stuck or slow to respond, Shoreline automatically kills the pod, allowing Kubernetes to create a new pod, bringing the system back online.


Customer experience impact
Everything slows down if the logging system is slow
Occurrence frequency
Until root cause is found
Shoreline time to repair
1-2 Hours
Time to diagnose manually
Cost impact
Time to repair manually
5-7 Manual Hours

