The following section outlines scenarios that might result in issues with the database cluster.
Single-node failure
In certain scenarios, a node might fail and lose membership with the cluster's Primary Component. This can occur in the event of hardware failure, software failure, a loss of network connectivity, or failure of state transfer.
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
If the value returned is Primary, this indicates that the node is part of the Primary Component. If any other value is returned, the node is not part of the Primary Component and is non-operational.
In this scenario, the other nodes will continue to operate in the database cluster. When the non-operational node comes back online and can communicate with the other nodes within the cluster, it rejoins the cluster automatically and synchronizes data.
Multi-node failure
If there are at least two nodes in the cluster, the Primary Component remains in the cluster and the database service can continue to operate. If there is a multi-node failure where there is only a single node in the cluster, the cluster may not have the Primary Component, leaving all nodes in the cluster as non-operational.
Loss of network connectivity
In the event of loss of network connectivity, only one component in the cluster is chosen to be the Primary Component. The single node still operates as normal and the other nodes are non-operational, as they are no longer part of the Primary Component. In this scenario, verify your network connectivity and ensure that the nodes in the non-Primary Component can connect to the node in the Primary Component. Once connection is restored, the nodes synchronize data and rejoin the Primary Component.
System crash or sudden shutdown
In the event of a system crash or sudden shutdown of nodes, as long as two nodes are still operational in the cluster, the database cluster operates as normal. You can restart the container on the failed nodes and upon restart, the nodes connect to the cluster. If multiple nodes shut down and a single node operates in the cluster, the single node becomes non-operational.
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
If the value returned is Primary, the node is part of the Primary Component. If any other value is returned, the node is not part of the Primary Component and is non-operational.
If the node is part of the Primary Component, you can continue to restart the container on the failed nodes. Upon restart, the nodes reconnect to the cluster and no further action is required.
SET GLOBAL wsrep_provider_options='pc.bootstrap=YES';
Once you have executed the command, the node starts a new Primary Component. Once the container on the failed nodes restart, the nodes connect to the new Primary Component and initialize a state snapshot transfer.
docker restart
for the nodes to join
the new cluster.