Networking Issues

Network related issues are often hard to diagnose, and can lead to a very bad experience for customers.

Networking

The problem

There can be a number of network related issues that are very hard to diagnose because they don’t occur consistently across the entire network. In many situations, basic checks of the fleet will make it look like 99% of the fleet is performing normally and that there is just some mild variability in network connectivity. In reality, there may be a small number of nodes that can no longer connect to the network.

This could lead to a very bad experience for a small number of customers. These types of incidents can often be hard to diagnose because they are literally like searching for a needle in a haystack.

The larger the fleet, the more likely companies are to experience this type of incident.

The solution

Typically, Shoreline does not trigger an automated repair for this type of incident. Instead, Shoreline provides a series of diagnostics that help on-call teams more quickly pin-point the specific network issue and nodes affected by the issue. These diagnostics eliminate hours of wasted time that operators would otherwise spend trying to manually uncover the issue. Here are the diagnostics run by Shoreline:

  1. Curl an HTTP endpoint in parallel across the fleet and return a status code. Way to check if services your system is depending on, can each instance of your application connect and authenticate to the service.
  2. DNS lookup. Checking to see if each instance of our application resolves domains to IP addresses in the same way. Sometimes a portion of the fleet might have stale entries and therefore might have failed requests.
  3. Ping. Used to check connectivity at the network layer. Can the nodes in one region or availability zone connect to nodes in another region. Sometimes you don’t have connectivity, sometimes there is high latency and other times there is high packet loss.
  4. Measure the number of outbound requests to a specific port. Can help you detect if you are connecting to APIs or ports that are unexpected. Sometimes there are too many processes running, generating an unexpected number of connections.
  5. Measure the number of inbound requests to a specific port. Can help you detect if you are receiving an unexpected number of connections or API calls from external sources.

Highlights

Customer experience impact
Potential hours of downtime
High
Occurrence frequency
Weekly for fleets with many nodes
High
Shoreline time to repair
1-2 minutes to repair
Low
Time to diagnose manually
Security
Cost impact
Time to repair manually
1-6 manual hours to repair
High

Related Solutions