Unsuccessful Cluster Failover due to 'No valid standby server present'

Overview

The HA service can fail, resulting in the lack of active resource sync between the two HA nodes, configured virtual IPs (VIPs) disappearing, and ultimately unsuccessful failover attempts with the following errors:

Failover for Cluster <cluster name> failed, Error verifying failover request: No valid standby server present

Navigate to SETTINGS > HA Settings on the ScaleArc dashboard to confirm that the HA Service is down.

HA_down.png

The following events will also be logged in the ScaleArc event logs on the UI as well as the /data/logs/services/alert_engine.log (Alert Engine log) file:

ScaleArc failed over to the standby instances with the following errors. <primary node hostname>
2020-10-12 09:05:42 ScaleArc recently encountered an issue that requires analysis. Please collect the relevant logs and send it to ScaleArc support.
2020-10-12 09:05:03 Peer node is standby. Potential disk issue. Please check <secondary node hostname>
2020-10-12 09:03:32 ScaleArc recently encountered an issue that requires analysis. Please collect the relevant logs and send it to ScaleArc support.
2020-10-09 13:30:29 Failover for Cluster bmirschp01my failed, Error verifying failover request: No valid standby server present
...

Further details on the root cause for the core dump alerts should be obtained by running the idblog_collector script and looking out for similar messages in the debug logs like the ones shown below from the /var/log/messages logfile:

Oct 12 09:01:13 cmysqlp03 cib[11187]:  error: couldn't open file /dev/shm/qb-cib_ro-request-11187-6621-14-header: Permission denied (13)
Oct 12 09:01:13 cmysqlp03 cib[11187]:  error: couldn't open file /var/run/qb-cib_ro-request-11187-6621-14-header: Permission denied (13)
Oct 12 09:01:13 cmysqlp03 cib[11187]:  error: couldn't create file for mmap
Oct 12 09:01:13 cmysqlp03 cib[11187]:  error: qb_rb_open:cib_ro-request-11187-6621-14: Permission denied (13)
Oct 12 09:01:13 cmysqlp03 cib[11187]:  error: shm connection FAILED: Permission denied (13)
Oct 12 09:01:13 cmysqlp03 cib[11187]:  error: Invalid IPC credentials (11187-6621-14).
Oct 12 09:01:13 cmysqlp03 python2.6: Found witness type: cluster
Oct 12 09:01:14 cmysqlp03 cib[11187]:  error: couldn't open file /dev/shm/qb-cib_ro-request-11187-6808-14-header: Permission denied (13)
Oct 12 09:01:14 cmysqlp03 cib[11187]:  error: couldn't open file /var/run/qb-cib_ro-request-11187-6808-14-header: Permission denied (13)
Oct 12 09:01:14 cmysqlp03 cib[11187]:  error: couldn't create file for mmap
Oct 12 09:01:14 cmysqlp03 cib[11187]:  error: qb_rb_open:cib_ro-request-11187-6808-14: Permission denied (13)
Oct 12 09:01:14 cmysqlp03 cib[11187]:  error: shm connection FAILED: Permission denied (13)
Oct 12 09:01:14 cmysqlp03 cib[11187]:  error: Invalid IPC credentials (11187-6808-14).

The list of configured Virtual IPs (VIPs) will be empty when you navigate to SETTINGS > Network Settings:

Empty_virtual_IPs.png

Solution

ScaleArc High Availability (HA) deployment provides uninterrupted operation that would otherwise occur when ScaleArc is running as standalone and became unavailable.

One of the appliances in the HA pair functions as the primary node while the other as the secondary node ready to take over in case the primary is unable to accept connections.

The primary ScaleArc node in a HA configuration will begin core dumping when it is unable to successfully failover to the secondary node anytime a failover attempt is triggered, either automatically or manually by an administrator.

/dev/shm/* permission errors in the above debug logs and the missing VIPs are a strong indicator that a 3rd party application or an external process is altering the default permissions for this directory which is 777 to more restrictive permissions in an attempt to tighten the security of the Linux system. Typical 3rd party applications that are known to alter file system permissions include security solutions, data protection, and database audit tools such as Imperva Database Security Agent.

Changing the permissions of this directory to anything other than 777 causes the HA service to fail with the reported alerts and an HA failover attempt to occur.

Some of the files in /dev/shm are owned by hacluster:haclient user and group and the HA processes will not have access to them if the permissions are changed from 777 hence the HA failover errors.

Rebooting the appliance restores the /dev/shm permissions back to 777 which allows the HA services to run properly but this may only be a temporary fix if the application that interferes with the directory permissions is not identified and stopped or configured to exclude /dev/shm from its actions.

The recommended permanent fix is to identify the application or service that is changing the permissions and apply exclusions for the /dev/shm directory.

Testing

Once the directory permissions are corrected, navigate to SETTINGS > HA Settings on the ScaleArc dashboard on both the primary and secondary nodes to verify that the HA Service is running and any configured VIPs are now present under SETTINGS > Network Settings.

HA Settings on Primary node:

HA_running_-_Primary.png

HA Settings on Secondary node:

HA_running_-_Secondary.png

Manual failover between both nodes should now succeed without the errors encountered before.

Back to top

Comments

0 comments

Please sign in to leave a comment.