DNS Lag

Trigger rolling restarts of the DNS servers when they are responding slowly and causing widespread system issues.

Major outage

The problem

Networking is complex with Kubernetes and often the most common problems and outages in a Kubernetes cluster come from DNS issues. CoreDNS, the default Kubernetes DNS service, can degrade in performance with too many calls to it causing massive latency. Once latency between the pod and CoreDNS reaches one second or more, it impacts both the customer and ultimately their SLA. However, most organizations merely monitor CoreDNS and continue to manually address the issue, causing unacceptable delays and potentially system outages. This issue is sometimes hard to diagnose because DNS issues have broad impact, and the underlying cause is often unclear. Services may be running fine, but can't communicate with each other.

The solution

The Shoreline CoreDNS Op Pack monitors metrics and automatically triggers an Action that restarts the CoreDNS pod once latency exceeds a configurable threshold. Shoreline can gather CoreDNS metrics as frequently as once per second, using multiple data points and different percentiles when deciding if CoreDNS resolves slowly or not. Shoreline focuses on DNS latency rather than overall measured latency because it helps disambiguate latency from the network versus latency from DNS resolution. This practice prevents false positives when clusters experience high network latency. The Alarm threshold is configured in milliseconds so that you can tightly control the latency tolerance. Once an issue is identified, Shoreline triggers rolling restarts of CoreDNS pods to prevent service outages.

In addition, the CoreDNS Op Pack offers the ability to automatically create a PagerDuty incident or Slack message when an Alarm or Action is triggered. This capability eliminates another manual step by automating the notification process for further root cause analysis.

Highlights

Customer experience impact
Can bring down an entire cluster
High
Occurrence frequency
Depends on # and size of clusters single digit clusters ~ quarterly double digit clusters ~ monthly
Medium
Shoreline time to repair
Shoreline fixes this with zero downtime
Low
Time to diagnose manually
1-4 hours
Medium
Security
Cost impact
Time to repair manually

Related Solutions