DNS Troubleshooting

Runbook to detect root causes of DNS Lag

The problem

Networking is complex with Kubernetes and often the most common problems and outages in a Kubernetes cluster come from DNS issues. CoreDNS, the default Kubernetes DNS service, can degrade in performance with too many calls to it causing massive latency. Once latency between the pod and CoreDNS reaches one second or more, it impacts both the customer and ultimately their SLA. However, most organizations merely monitor CoreDNS and continue to manually address the issue, causing unacceptable delays and potentially system outages. This issue is sometimes hard to diagnose because DNS issues have broad impact, and the underlying cause is often unclear. Services may be running fine, but can't communicate with each other.

The solution

This Shoreline runbook picks up where the DNS Lag runbook leaves off. Click from the PagerDuty alert into this runbook to begin interactive diagnostics. If automatically restarting k8s’s CoreDNS pods doesn’t bring DNS back online, this runbook facilitates debugging the issue and identifying the root cause. Is it network saturation? Too many pods on the host? IP exhaustion? Or something else? Click through each cell in the runbook to discover the root cause. Once the root cause is identified, take action to rebalance the cluster by adding taints or tollerations or adjusting the host’s capacity.

Highlights

Customer experience impact

Can bring down an entire cluster

High

Occurrence frequency

Depends on # and size of clusters single digit clusters ~ quarterly double digit clusters ~ monthly

Medium

Shoreline time to repair

Shoreline fixes this with zero downtime

Low

Time to diagnose manually

1-4 hours

Medium

Security

Cost impact

Time to repair manually

DNS Troubleshooting

Runbook to detect root causes of DNS Lag

The problem

The solution

Highlights

High

Medium

Low

Medium

Related Solutions

Give us two weeks and we'll show you how to eliminate 30% of your incidents.