Restarting the database cluster when some clusters fail - Adaptive Applications - BlueCat Gateway - 23.2.4

BlueCat Distributed DDNS Administration Guide

Locale
English
Product name
BlueCat Gateway
Version
23.2.4

The following procedure describes how to bootstrap (restart) a Distributed DDNS database cluster when some nodes in the cluster stop functioning, but at least one node remains operational and is responding to client requests.

The bootstrapping process depends on whether the failed nodes stopped gracefully (such as by manually stopping them), or whether they involuntarily disappeared from the cluster.

Note: If only one node fails, it will rejoin the cluster when it is restarted.

Some nodes are gracefully stopped

Since the remaining active nodes are still responding to client requests, you can simply start each remaining stopped node one by one. To do so, use the following docker command on each stopped node:

docker start <Node container name>

Make sure you wait for each node to fully start and sync with the rest of the cluster before starting another node.

Note: When a new node joins the cluster, one of them will change to the Donor/Desynced state, since it must provide the state transfer to at least the first joining node. This node can still be read and written to, but responses might be slower depending on the amount of data sent during the state transfer. If your system uses load balancers, make sure they do not accidentally flag the Donor node as "inoperative" (and if they did, address the situation).

Some nodes disappeared from the cluster

In this case, multiple nodes have failed or experienced power outages, and the remaining node or nodes are too few to form a quorum for the cluster.

When this happens, the cluster switches to non-primary mode. In this mode, MySQL refuses to serve SQL queries. The mysqld process on the remaining nodes continues to run and receive connections. But requests and statements related to data will fail (often with errors like "ERROR 1047 (08S01): WSREP has not yet prepared node for application use").

Tip: Until the remaining nodes realize they cannot access the failed nodes, it might be possible to read data from them. However, write requests are blocked.
There are two possible approaches:
  • Address the issues with the failed nodes, then restart those nodes. When those nodes are restarted and become available, the original remaining nodes will detect them and automatically reform the cluster. When there are enough nodes to form a quorum, the cluster will respond to requests.

  • If the failed nodes cannot be restarted immediately but you still want to restore service on the remaining nodes, you must bootstrap the primary component on those nodes manually. To do so, run the following command on those nodes:
    docker exec -it <node-container-name> mariadb
    SET GLOBAL wsrep_provider_options='pc.bootstrap=true';
    Note: This approach works only if the other nodes are not running. If you perform this while they are still running as an active cluster, you might end up with two clusters, each with increasingly divergent data.