Changing the configuration of the RabbitMQ nodes
Problem
Invasive changes that need to be performed on the RabbitMQ cluster should be implemented via rolling upgrades.
Approach
In a first step, an additional node with the changed configuration is added to the cluster. Afterwards, nodes are reconfigured one-by-one. Finally, the additional node is removed again.
Test-running the changed configuration on an additional node will prevent certain problems from being visible to production clients because the additional node is not used by them:
- configuration changes that prevent the RabbitMQ service from coming up will not result in a degraded cluster
- configuration changes that prevent the RabbitMQ node from joining the cluster will not result in lost messages
- configuration changes that prevent clients from connecting to the RabbitMQ node can be detected in isolation
Alternative approaches
The additional node is not strictly necessary. Any changes can also be performed directly on the production nodes. The following aspects should be kept in mind in that case.
Any configuration changes that prevent the RabbitMQ service from restarting will result in a node being offline for RabbitMQ consumers. By default, clients will round-robin through all configured nodes until they are successful, so in practice this shouldn’t be a problem.
While a node is offline, the cluster will be running in a degraded state. As e.g. automatic updates of the other nodes are still enabled, the cluster could degrade further because of e.g. nodes rebooting for offline update installations.
Any configuration changes that will result in a running RabbitMQ node that is not joined with the cluster (split cluster) will result in listening clients connecting successfully, but not being able to consume messages. More importantly, any messages sent to the node will not be synchronized to the cluster and effectively be lost.
Steps
-
Create a new RabbitMQ node in
us-east-1d
and join it to the cluster viaPLAYBOOK_NAME=aws-arr-rabbitmq-instance \ ./ansible_deploy.sh \ --extra-vars '{"rabbitmq_instances": ["us-east-1d"]}' \ --limit arr-cki-prod-rabbitmq-us-east-1d,localhost \ --skip-tags qualys
The Qualys cloud agent installation can be skipped via
--skip-tags qualys
ifdnf
is not available locally. -
Log into the RabbitMQ management console. Ensure that there are now four nodes in the cluster. Determine the status of the newly joined node by searching for the node with the lowest uptime.
-
One-by-one, implement the changes on the RabbitMQ nodes. Ensure that a changed node is healthy and completely synced to the cluster after being restarted before continuing with the next node.
-
To remove the additional node in
us-east-1d
, drain it first viaansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1d \ sudo rabbitmq-upgrade drain
In the AWS console, disable termination protection of the EC2 instance via
actions
->instance settings
->change termination protection
. Then terminate it viainstance state
->terminate instance
.To remove the node from the cluster, get the lists of cluster nodes via
ansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1a \ sudo rabbitmqctl cluster_status
Compare the
Disk Nodes
andRunning Nodes
lists to find the name of the terminated additional node, and remove it from the cluster viaansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1a \ sudo rabbitmqctl forget_cluster_node rabbit@ip-123-45-67-89.ec2.internal
-
In the RabbitMQ management console, ensure that there are only three nodes shown.