IP Exhaustion

Clear away failed jobs or pods that are consuming too many IP addresses.

Kubernetes

The problem

Kubernetes jobs are the perfect way to do menial and automated tasks like ETL, database backup, nightly reports, and others.  If the built-in solution is insufficient, upgrade to Argo or other job processing technologies to spin up pods on demand to accomplish the next step in a process or run a shell script.

Automation like this is a great way to add business value to an existing web property or integrate two systems together. As each container spins up, it grabs a hostname and IP, and gets to work. But what happens after the job runs?

If a job doesn’t exit cleanly (whether it finished successfully or not) the pod is left in a terminated or errored state. To facilitate debugging and diagnosis, Kubernetes leaves the pod remnants so one can harvest the pod’s events or the container’s logs.

After a few dozen or a hundred runs, these extra pods can quickly exhaust iptables’ available IP addresses in the cluster.  When the IPs are gone, new pods can’t get scheduled.

The solution

This Shoreline automation scans for terminated Kubernetes jobs and their corresponding completed or errored pods. When it finds one, it harvests the diagnostic details necessary for more permanent remediation, then terminates the pod. This frees the pod’s IP, and makes scheduling new jobs possible. As part of this action, the Shoreline automation can create a ticket in an external system such as Jira or PagerDuty with the associated diagnostic information so developers can adjust the code to make the job complete successfully.

Check out this blog post for more information on the IP Exhaustion Op Pack.

Highlights

Customer experience impact
Potential hours of downtime
High
Occurrence frequency
Often after many jobs have run
High
Shoreline time to repair
1-2 minutes
Low
Time to diagnose manually
Security
Cost impact
Time to repair manually
1-2 manual hours
High

Related Solutions