DNS Troubleshooting

Runbook to detect root causes of DNS Lag

Kubernetes

The problem

Networking is complex with Kubernetes and often the most common problems and outages in a Kubernetes cluster come from DNS issues. CoreDNS, the default Kubernetes DNS service, can degrade in performance with too many calls to it causing massive latency. Once latency between the pod and CoreDNS reaches one second or more, it impacts both the customer and ultimately their SLA. However, most organizations merely monitor CoreDNS and continue to manually address the issue, causing unacceptable delays and potentially system outages. This issue is sometimes hard to diagnose because DNS issues have broad impact, and the underlying cause is often unclear. Services may be running fine, but can't communicate with each other.

The solution

This Shoreline runbook picks up where the DNS Lag runbook leaves off. Click from the PagerDuty alert into this runbook to begin interactive diagnostics. If automatically restarting k8s’s CoreDNS pods doesn’t bring DNS back online, this runbook facilitates debugging the issue and identifying the root cause. Is it network saturation? Too many pods on the host? IP exhaustion? Or something else? Click through each cell in the runbook to discover the root cause. Once the root cause is identified, take action to rebalance the cluster by adding taints or tollerations or adjusting the host’s capacity.

Highlights

Customer experience impact
Can bring down an entire cluster
High
Occurrence frequency
Depends on # and size of clusters single digit clusters ~ quarterly double digit clusters ~ monthly
Medium
Shoreline time to repair
Shoreline fixes this with zero downtime
Low
Time to diagnose manually
1-4 hours
Medium
Security
Cost impact
Time to repair manually

Related Solutions