Changing the configuration of the RabbitMQ nodes

How to mess with the RabbitMQ cluster without causing an outage

This page has an internal companion page which contains additional information.

Problem

Invasive changes that need to be performed on the RabbitMQ cluster should be implemented via rolling upgrades.

Approach

In a first step, an additional node with the changed configuration is added to the cluster. Afterwards, nodes are reconfigured one-by-one. Finally, the additional node is removed again.

Test-running the changed configuration on an additional node will prevent certain problems from being visible to production clients because the additional node is not used by them:

configuration changes that prevent the RabbitMQ service from coming up will not result in a degraded cluster
configuration changes that prevent the RabbitMQ node from joining the cluster will not result in lost messages
configuration changes that prevent clients from connecting to the RabbitMQ node can be detected in isolation

Alternative approaches

The additional node is not strictly necessary. Any changes can also be performed directly on the production nodes. The following aspects should be kept in mind in that case.

Any configuration changes that prevent the RabbitMQ service from restarting will result in a node being offline for RabbitMQ consumers. By default, clients will round-robin through all configured nodes until they are successful, so in practice this shouldn’t be a problem.

While a node is offline, the cluster will be running in a degraded state. As e.g. automatic updates of the other nodes are still enabled, the cluster could degrade further because of e.g. nodes rebooting for offline update installations.

Any configuration changes that will result in a running RabbitMQ node that is not joined with the cluster (split cluster) will result in listening clients connecting successfully, but not being able to consume messages. More importantly, any messages sent to the node will not be synchronized to the cluster and effectively be lost.

Steps

Create a new RabbitMQ node in us-east-1d and join it to the cluster via

PLAYBOOK_NAME=aws-arr-rabbitmq-instance \
    ./ansible_deploy.sh \
    --extra-vars '{"rabbitmq_instances": ["us-east-1d"]}' \
    --limit arr-cki-prod-rabbitmq-us-east-1d,localhost \
    --skip-tags qualys

The Qualys cloud agent installation can be skipped via --skip-tags qualys if dnf is not available locally.

Log into the RabbitMQ management console. Ensure that there are now four nodes in the cluster. Determine the status of the newly joined node by searching for the node with the lowest uptime.
One-by-one, implement the changes on the RabbitMQ nodes. Ensure that a changed node is healthy and completely synced to the cluster after being restarted before continuing with the next node.
To remove the additional node in us-east-1d, drain it first via
```
ansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1d \
    sudo rabbitmq-upgrade drain
```
In the AWS console, disable termination protection of the EC2 instance via actions -> instance settings -> change termination protection. Then terminate it via instance state -> terminate instance.

To remove the node from the cluster, get the lists of cluster nodes via
```
ansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1a \
    sudo rabbitmqctl cluster_status
```
Compare the Disk Nodes and Running Nodes lists to find the name of the terminated additional node, and remove it from the cluster via
```
ansible_ssh.sh arr-cki-prod-rabbitmq-us-east-1a \
    sudo rabbitmqctl forget_cluster_node rabbit@ip-123-45-67-89.ec2.internal
```
In the RabbitMQ management console, ensure that there are only three nodes shown.