Elastic Sharding Replica Management

Determine when your elastic search clusters have too few replicas per shard, and automatically kick off healing.

Major outage

The problem

When we first deploy ElasticSearch, it’s easy to craft the indexes we need and ensure they’re nicely balanced across all the nodes in the cluster. But as we ingest more data, Elastic can start to slow down. In time, the Elastic cluster is crawling along, and dependent applications may start going offline.

How did we get here? How do we get out? Should we wait until catastrophic failure before we take action?

With all the other Ops tickets in play, it’s easy to not even discover there’s a problem until it’s too late: the application is offline or users have moved on to competitors’ tools. Oops.

The solution

This Shoreline runbook scans the ElasticSearch indexes for low-performing queries. When slow indexes are identified, Shoreline automatically adjusts the replica count for those indices, allowing them to self-heal, and bringing the index back up to peak performance.

Highlights

Customer experience impact
Potential hours of downtime
High
Occurrence frequency
Until the root cause is identified
High
Shoreline time to repair
1-2 minutes
Low
Time to diagnose manually
Security
Cost impact
Time to repair manually
1-2 manual hours
High

Related Solutions