Recovering Distributed DDNS Data Nodes and resetting the cluster - Adaptive Applications - BlueCat Gateway - 21.3

BlueCat Distributed DDNS Administration Guide

Locale
English
Product name
BlueCat Gateway
Version
21.3

The following section outlines scenarios that might result in issues with the database cluster.

Single-node failure

In certain scenarios, a node might fail and lose membership with the cluster's Primary Component. This can occur in the event of hardware failure, software failure, a loss of network connectivity, or failure of state transfer.

To verify if a node is part of the Primary Component, execute the following command on the node using any MySQL client:
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';

If the value returned is Primary, this indicates that the node is part of the Primary Component. If any other value is returned, the node is not part of the Primary Component and is non-operational.

In this scenario, the other nodes will continue to operate in the database cluster. When the non-operational node comes back online and can communicate with the other nodes within the cluster, it rejoins the cluster automatically and synchronizes data.

Multi-node failure

If there are at least two nodes in the cluster, the Primary Component remains in the cluster and the database service can continue to operate. If there is a multi-node failure where there is only a single node in the cluster, the cluster may not have the Primary Component, leaving all nodes in the cluster as non-operational.

Loss of network connectivity

In the event of loss of network connectivity, only one component in the cluster is chosen to be the Primary Component. The single node still operates as normal and the other nodes are non-operational, as they are no longer part of the Primary Component. In this scenario, verify your network connectivity and ensure that the nodes in the non-Primary Component can connect to the node in the Primary Component. Once connection is restored, the nodes synchronize data and rejoin the Primary Component.

If you have a two-node cluster configured and one node loses connection to the other, both nodes become non-operational. Verify your network connectivity and ensure that the nodes can communicate. Once the connection is restored, the nodes synchronize data and the two nodes operate under the Primary Component.
Note: When repairing nodes in a multi-node cluster failure, the order in which nodes must be brought back online depends on the order in which connectivity was lost. For example, you have three nodes in a cluster: node 1, node 2, and node 3. If node 3 loses connectivity, then node 2 followed by node 1, you must attempt to regain connection between node 1 and node 2 before attempting to connect node 3 to the other nodes in the cluster. You must not attempt to connect node 3 to any other node in the cluster, as this will not restore the Primary Component. Once node 1 and node 2 restore communication and synchronize data, you can then attempt to restore connection between node 3 and the other two nodes in the cluster.

System crash or sudden shutdown

In the event of a system crash or sudden shutdown of nodes, as long as two nodes are still operational in the cluster, the database cluster operates as normal. You can restart the container on the failed nodes and upon restart, the nodes connect to the cluster. If multiple nodes shut down and a single node operates in the cluster, the single node becomes non-operational.

To restore the cluster, verify whether the remaining node is part of the Primary component by executing the following command using any MySQL client:
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';

If the value returned is Primary, the node is part of the Primary Component. If any other value is returned, the node is not part of the Primary Component and is non-operational.

If the node is part of the Primary Component, you can continue to restart the container on the failed nodes. Upon restart, the nodes reconnect to the cluster and no further action is required.

If the node is not part of the Primary component, you must bootstrap the node to form a new Primary Component. On the node, execute the following command using any MySQL client:
SET GLOBAL wsrep_provider_options='pc.boostrap=YES';
Note: You might require privileged permissions to execute this command.

Once you have executed the command, the node starts a new Primary Component. Once the container on the failed nodes restart, the nodes connect to the new Primary Component and initialize a state snapshot transfer.

Note: If the failed nodes are restarted before the functional node bootstraps a new Primary Component, perform a docker restart for the nodes to join the new cluster.