When we first deploy a solution into Kubernetes, we’re likely watching the pods starting up, manually testing and verifying the service endpoints, and validating the system is functioning as expected. As time passes and automated DevOps pipelines start deploying new versions or security patches require updating to new image labels, it’s easy for a pod to get stuck.
Perhaps Kubernetes can’t pull the specified label or perhaps the software crashes on startup. Perhaps missing configuration makes the pod crash quickly. Or maybe the pod is frequently evicted.
Whatever the cause, this container is now offline. If all the containers in this microservice are offline, the system is offline until the failure is discovered, the root cause is identified, and corrected.
Shoreline automatically discovers pods in CrashLoopBackoff, crashed, and other undesirable states. When a pod is discovered, it’s immediately evicted so Kubernetes can bring a new pod back online. If this doesn’t bring the container back online, an alert is generated in both Shoreline and optionally an external system like OpsGenie or PagerDuty. As the user clicks the link in the ticket, a Shoreline runbook facilitates changing the image version, adding tolerations or taints to get to different hosts, or restoring a previously known good YAML configuration.