Java virtual machines (JVMs) can often face memory issues. Usually this is because certain requests, payloads or jobs consume more memory than was anticipated. The Java garbage collector is actually quite robust, so eventually the situation is resolved, but while it is occurring, garbage collection takes priority and latency often spikes leading to poor customer experiences. Permanently fixing this type of issue often requires heap dump and garbage collection statistics that are only available while the issue is occurring. What makes this situation even harder is that very few people understand how garbage collection works, making it even more tricky to diagnose. SREs are frequently asked to capture the debug data for this situation, which can lead to hours of SSH-ing into box after box trying to catch a JVM experiencing the memory issue.
With Shoreline, customers can set an alarm that looks for a heap size that exceeds a certain threshold. Once the alarm fires, a script can be executed that runs stdout to run jcmd, jstack, jstat and jmap to get a heap dump, thread dump, GC stats and heap stats. Once this data is collected, it is pushed to a cloud storage service and then the JVM is restarted. This is all done in seconds, ensuring the least possible impact on the customer experience. This also saves the SRE hours of exploratory work and ensures that engineering has everything they need to fix the root cause of the issue.