Pod Out of Memory (OOM)

Many different types of application errors can lead to out of memory errors (OOMs) in Kubernetes.

Kubernetes

The problem

Many different types of application errors can lead to out of memory errors (OOMs) in Kubernetes. When a POD is close to running out of memory, an operator needs to capture the diagnostic/debugging information. Any processes running at the moment the memory limit is reached will be killed. If the operator hasn’t captured the diagnostic data before the OOM the POD will be restarted and that data will be lost. Any attempt to capture the diagnostics data afterwards would be pointless because the issue is no longer happening. The challenge here is that the operator wants to capture diagnostic data before the OOM, but as close as possible to the OOM so that you have the full picture of the incident. This is particularly difficult if the monitoring data is only aggregated every 5 minutes. Below are some of the most common causes of OOM:

  • Memory leaks
  • Loading larger than anticipated data into memory
  • Running more processes than anticipated
  • Another process on the machine is using more memory than expected
  • A specific execution/query consumes more resources than anticipated
  • Increased load / usage
  • Deploying a new version of the software and it has a different memory footprint

Any application that consumes memory can suffer from this type issue. When an OOM occurs it hurts customer experience. It is also hard to capture the diagnostic data. Even with the diagnostic data, the underlying cause can be hard to diagnose and hard to fix.

The solution

This Op Pack includes a monitor that looks for memory usage that hits a certain threshold, it then captures diagnostic data, pushes that data to a cloud storage service, like S3, and then appends the data to a ticket or sends a Slack message to a selected channel.

Shoreline monitors the total bytes consumed versus the memory limit. For certain applications, the Op pack can monitor garbage collection activity. The Op Pack also includes the ability to increase the memory limit which can fix the issue in the moment or at least buy operators time to find other fixes.

Highlights

Customer experience impact
Causes degraded service
High
Occurrence frequency
Frequency of Occurrence:
High
Shoreline time to repair
Time to diagnose manually
Security
Cost impact
Time to repair manually
2-3 hours
High

Related Solutions