Pod Status Monitoring

Restart pods in CrashLoopBackOff and other stuck statuses, and alert if that doesn’t get it back online

Kubernetes

The problem

When we first deploy a solution into Kubernetes, we’re likely watching the pods starting up, manually testing and verifying the service endpoints, and validating the system is functioning as expected. As time passes and automated DevOps pipelines start deploying new versions or security patches require updating to new image labels, it’s easy for a pod to get stuck.

Perhaps Kubernetes can’t pull the specified label or perhaps the software crashes on startup. Perhaps missing configuration makes the pod crash quickly. Or maybe the pod is frequently evicted.

Whatever the cause, this container is now offline.  If all the containers in this microservice are offline, the system is offline until the failure is discovered, the root cause is identified, and corrected.

The solution

Shoreline automatically discovers pods in CrashLoopBackoff, crashed, and other undesirable states. When a pod is discovered, it’s immediately evicted so Kubernetes can bring a new pod back online. If this doesn’t bring the container back online, an alert is generated in both Shoreline and optionally an external system like OpsGenie or PagerDuty. As the user clicks the link in the ticket, a Shoreline runbook facilitates changing the image version, adding tolerations or taints to get to different hosts, or restoring a previously known good YAML configuration.

Highlights

Customer experience impact
Potential hours of downtime
HIGH
Occurrence frequency
Until the root cause is identified
HIGH
Shoreline time to repair
1-2 minutes
Low
Time to diagnose manually
1-4 hours
HIGH
Security
Cost impact
Time to repair manually
1-2 manual hours
HIGH

Related Solutions