Fixing Clusters Marked as Health Down in Highly Loaded Environments

Overview

This article helps users avoid failovers and undesired behavior due to high load.

 

Affected Versions

ScaleArc all versions, MySQL cluster with very highly loaded dataservers

 

Requirements

Access to ScaleArc UI

 

Root Cause

In very highly loaded environments, dataservers that are part of a cluster may stop responding to "SHOW SLAVE STATUS" queries in a timely fashion. These queries are run continuously to validate replication status and health. Failure to respond to these queries causes ScaleArc to consider the dataserver as down/unhealthy with the consequences (not sending traffic to that server if Standby-Read, or even trigger a failover if this happens on the master Read+Write server.

 

Resolution

  1. Login to ScaleArc UI
  2. On the Clusters Section for the cluster to modify click Cluster Settings button
  3. On the Server tab change Health Check interval from 3 to 10 or 20, depending on the observed response time to "SHOW SLAVE STATUS" queries as shown below. This slow down the failover detection time in favor of stable operation. However, ScaleArc still detects real server down or replication broken situations and acts accordingly.

 

Health_Check_interval.png

 

Validation

If successful, users should not get any alerts or failovers. At a minimum, the number of incidents should decrease.

 

Content Author: Miguel Molina

Comments

0 comments

Please sign in to leave a comment.