RabbitMQ Mirrored Queues Gotchas

RabbitMQ Classic Mirrored Queues (also known as HA queues) are a feature of RabbitMQ that allows the replication of queue contents across multiple nodes within a RabbitMQ cluster. This replication ensures that if the node hosting the leader queue fails, one of the followers can take over, providing high availability.

Although we now have a better solution with Quorum Queues, and RabbitMQ Classic Mirrored Queues are deprecated, many systems still have them in use for various reasons. Thus, we would like to explore some unexpected behaviours of HA queues to help us better understand their usage and management.

Setup

In order to maintain a clear flow, we have placed the detailed RabbitMQ cluster setup in a separate section at the end of this blog post. If you are interested in testing the scenarios presented, feel free to check it out.

In brief, we have a cluster of 2 RabbitMQ 3.12.0 nodes running on Docker with the following specifications:

rmq1:
AMQP: amqp://localhost:5672
Management UI: http://localhost:15672
rmq2:
AMQP: amqp://localhost:5673
Management UI: http://localhost:15673

The default username and password to login to the Management UI is guest/guest by default.

We will also use RabbitMQ PerfTest 2.21.0, which is a testing tool for RabbitMQ.

HA queues are enabled by setting up a policy. This can be done via Management UI or a command line as follows:

rabbitmqctl set_policy ha-all ".*" '{"ha-mode":"all","ha-sync-mode":"automatic","ha-sync-batch-size": 2}'

The command sets

A policy named "ha-all" on all queues and exchanges (the ".*" is a regular expression that matches all names).
The parameters of the policy: In this case, it sets the "ha-mode" to all, which means that all nodes in the cluster will keep a copy of the messages. The "ha-sync-mode" is set to automatic, which means that the leader will automatically synchronise data to all followers. The "ha-sync-batch-size" is set to 2, which means that synchronisation will happen in batches of 2 messages.

With docker:

docker exec rmq1 sh -c "rabbitmqctl set_policy ha-all \".*\" '{\"ha-mode\":\"all\",\"ha-sync-mode\":\"automatic\",\"ha-sync-batch-size\":2}'

Now, let’s test out potential issues that may arise with these HA queues.

HA queues with automatic synchronisation

If we take a closer look at our policy, it specifies the queue synchronisation as automatic:

"ha-sync-mode": "automatic"

What automatic synchronisation does is when a new follower is created, it will automatically synchronise messages from its leaders. Sounds convenient!

But watch out! Queue synchronisation is a blocking process, meaning all the queue operations are temporarily stopped. In simpler terms, messages cannot be published (“routed”, to be exact) to and consumed from that queue. The queue looks like “freezing” until the synchronisation finishes.

Let’s have a look at an example in which we publish 1M messages to an HA queue, then attach a slow consumer to it and restart one node.

We are going to have 10 producers each sending 100,000 messages results in a total of 1,000,000 messages sent. First, publish messages by:

java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5672 \
--producers 10 \
--consumers 0 \
--queue haq \
--pmessages 100000 \
--auto-delete false \
--rate 10000

Check the queue is filled up at: http://localhost:15672/#/queues

When it’s ready, we can start the slow consumer which will process 10 msgs/s with a prefetch count of 10:

java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5672 \
--producers 0 \
--consumers 1 \
--queue haq \
--auto-delete false \
--qos 10 \
--consumer-rate 10

Then restart the rmq2 by:

docker exec rmq2 sh -c "rabbitmqctl stop_app && rabbitmqctl start_app"

Now, let’s observe the consumer log. We can see that there was an interruption in consuming messages around 10s.

id: test-231551-035, time 24.002 s, received: 8.0 msg/s, min/median/75th/95th/99th consumer latency: 0/0/0/0/0 µs

==> id: test-231551-035, time 34.003 s, received: 2.0 msg/s, min/median/75th/95th/99th consumer latency: 0/0/0/0/0 µs
id: test-231551-035, time 35.005 s, received: 10.0 msg/s, min/median/75th/95th/99th consumer latency: 0/0/0/0/0 µs

Take a look at the RabbitMQ log by:

docker logs rmq1 --tail 50

We can see the following:

2024-05-29 16:16:08.383967+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 999085 messages to synchronise
2024-05-29 16:16:08.384229+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: batch size: 2
2024-05-29 16:16:08.384507+00:00 [info] <0.2856.0> Mirrored queue 'haq' in vhost '/': Synchronising: all followers already synced
2024-05-29 16:16:08.384842+00:00 [info] <0.2146.0> Mirrored queue 'haq' in vhost '/': Primary replica of queue <rabbit@rmq1.1716998554.2143.0> detected replica <rabbit@rmq2.1716998554.1842.0> to be down
2024-05-29 16:16:13.292562+00:00 [info] <0.458.0> rabbit on node rabbit@rmq2 down
2024-05-29 16:16:13.985677+00:00 [info] <0.458.0> rabbit on node rabbit@rmq2 up
2024-05-29 16:16:13.987708+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 999029 messages to synchronise
2024-05-29 16:16:13.987750+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: batch size: 2
2024-05-29 16:16:13.987965+00:00 [info] <0.2920.0> Mirrored queue 'haq' in vhost '/': Synchronising: followers [rabbit@rmq2] to sync
2024-05-29 16:16:14.988235+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 92716 messages
2024-05-29 16:16:15.990297+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 184090 messages
2024-05-29 16:16:16.990478+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 280350 messages
2024-05-29 16:16:18.002438+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 377604 messages
2024-05-29 16:16:19.002617+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 466818 messages
2024-05-29 16:16:20.003608+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 557604 messages
2024-05-29 16:16:21.003751+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 662630 messages
2024-05-29 16:16:22.003952+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 765894 messages
2024-05-29 16:16:23.004448+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 874010 messages
2024-05-29 16:16:24.013900+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: 967204 messages
2024-05-29 16:16:24.303840+00:00 [info] <0.2143.0> Mirrored queue 'haq' in vhost '/': Synchronising: complete

RabbitMQ started synchronising the newly “restarted” follower from 16:16:08 to 16:16:24 which was nearly 16 seconds. This was around the time that the queue looked like it was freezing from the consumer’s perspective.

This is a simple demonstration to help us understand the automatic synchronisation mechanism and how it would affect the queue operation.

In production, of course, we should not set the "ha-sync-batch-size" that low.

"ha-sync-batch-size": 2

This setting determines the number of messages to be synchronised at a time, and it is by default set to 4096. If we didn’t adjust this value, the queue synchronisation process could not take that long. However, we need to consider that in a live system, there could be hundreds, thousands, or even millions of mirrored queues with a large number of messages. This could lead to high traffic and prolonged synchronisation, causing queues to remain blocked for extended periods.

Automatic synchronisation is still our recommended setting for the HA queues. Queue blocking during the synchronisation is also should be aware of so that we could have active control over our system and consumer development. A key to a happy Rabbit is to keep queues short.

Quorum queues, on the other hand, can synchronise only the changes in the queue state, called the delta, across the nodes in the RabbitMQ cluster. This synchronisation happens asynchronously, improving the efficiency and reliability of data replication without impacting queue availability.

Auto-delete property for an HA queue

In RabbitMQ, the auto-delete property is a feature that allows queues to be automatically removed when their last consumer disconnects.

Mirroring an auto-delete queue can lead to an unexpected behaviour that actually breaks the auto-delete feature.

First, let’s remove our old queue by:

docker exec rmq2 sh -c "rabbitmqctl delete_queue haq"

Now we start a new producer which connects to rmq2 and publishes messages to an auto-delete queue by:

# producer at rmq1
java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5672 \
--producers 1 \
--consumers 0 \
--rate 1 \
--queue haq \
--auto-delete true \
--disable-connection-recovery true

Check the queue is running at: http://localhost:15672/#/queues

Now, we start a consumer which connects to rmq2 and consumes messages from the above queue.

# consumer at rmq2
java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5673 \
--producers 0 \
--consumers 1 \
--queue haq  \
--disable-connection-recovery true

Then, it’s time to have some network disruption.

docker network disconnect rabbit rmq1
sleep 70
docker network connect rabbit rmq1

Take a look at the UI of both RabbitMQ nodes; we can see the following.

On rmq1, http://localhost:15672 shows that rmq2 is disconnected. No queue is left.

On rmq2, http://localhost:15673 shows that rmq1 is disconnected. Our queue is still running.

The reason is that when the network disruption occurred, from the perspective of rmq1, the consumer disconnected from the queue and the queue was deleted. However, from the perspective of rmq2, the consumer was still there and the queue remained active.

But remember, the queue leader was initially located on rmq1, so the queue remaining on rmq2 is actually a follower being promoted. From the log of rmq2, we can notice that.

2024-05-29 16:55:03.183127+00:00 [info] <0.691.0> Mirrored queue 'haq' in vhost '/': Secondary replica of queue <rabbit@rmq2.1717001554.691.0> detected replica <rabbit@rmq1.1717001554.707.0> to be down
2024-05-29 16:55:03.183230+00:00 [info] <0.691.0> Mirrored queue 'haq' in vhost '/': Promoting mirror <rabbit@rmq2.1717001554.691.0> to leader

The consequence of this is that messages which could not be replicated before the network disconnection occurred were lost (unless publisher confirm was in use).

If we switch our producer to rmq2, we can see the message flow back to normal.

# producer at rmq1 switched to rmq2
java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5673 \
--producers 1 \
--consumers 0 \
--rate 1 \
--queue haq \
--auto-delete true \
--disable-connection-recovery true

In general, we do not recommend mirroring non-durable/exclusive/auto-delete queues due to the nature of HA queues. HA queues aim to provide high availability while these queue types (non-durable/exclusive/auto-delete) usually aim to handle temporary tasks.

In RabbitMQ, quorum queues are designed with a focus on data safety and therefore, they do not support certain features such as non-durable, exclusive, or auto-delete queues. This is a significant departure from classic mirrored queues, and it’s done to ensure the safety of all messages. All quorum queues are durable, meaning messages survive broker restarts. Exclusive and auto-delete queues are only supported for non-mirrored classic queues. This design choice enhances reliability in RabbitMQ systems.

The consumer is notified about the fail-over

Let’s ask a question for the previous case. What if the consumer knows about the queue leader failure?

There is a feature which allows consumers to be notified when a queue leader fails over. This can be enabled by setting the x-cancel-on-ha-failover value as true.

Consumers will then get an AMQP "basic.cancel" AMQP command from RabbitMQ to be notified that the leader fails. As a result, the queue will not be deleted, and consumption can resume when the new leader is promoted.

Before running the following test, let’s restart the cluster with:

docker compose restart

As usual, start a producer on rmq1:

# producer at rmq1
java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5672 \
--producers 1 \
--consumers 0 \
--rate 1 \
--queue haq \
--auto-delete true \
--disable-connection-recovery true

Next, connect our new consumer to rmq2:

java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5673 \
--producers 0 \
--consumers 1 \
--queue haq \
--consumer-args x-cancel-on-ha-failover=true

When the messages start flowing, disconnect the rmq1 as we did in the previous example:

docker network disconnect rabbit rmq1
sleep 70
docker network connect rabbit rmq1

And move our producer to rmq2:

# producer at rmq1 switched to rmq2
java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5673 \
--producers 1 \
--consumers 0 \
--rate 1 \
--queue haq \
--auto-delete true \
--disable-connection-recovery true

Observing our consumer log, we can see that the consumer stopped for a few seconds. Then, it received a cancel command from the broker. After that, it consumed the messages again from the “switched” producer.

id: test-223035-661, time 1363.005 s, received: 1.00 msg/s, min/median/75th/95th/99th consumer latency: 0/0/0/0/0 µs
==> Consumer cancelled by broker for tag: amq.ctag-Hq2MkHDKh9sKVKEVCYIm-Q
    
==> id: test-223035-661, time 1467.005 s, received: 0.01 msg/s, min/median/75th/95th/99th consumer latency: 0/0/0/0/0 µs
id: test-223035-661, time 1468.001 s, received: 1.0 msg/s, min/median/75th/95th/99th consumer latency: 0/0/0/0/0 µs

In simple terms, the "x-cancel-on-ha-failover" feature provides additional information about a failover occurrence, allowing for appropriate actions to be taken, such as automatic subscription to the new leader.

The “x-cancel-on-ha-failover” is specific to classic mirrored queues. Quorum queues do not support this feature.

Delayed publisher confirms

Publisher confirms are a feature of RabbitMQ designed to ensure reliable publishing. When publisher confirms are enabled on a channel, messages that the client publishes are confirmed asynchronously by the broker. This means that the broker has accepted the messages on the server side. In a cluster (HA), it confirms that the message is accepted by all the queue followers.

But it comes with a trade-off. If RabbitMQ is under high load or slow network, it will take more time to expect a publisher confirm from RabbitMQ.

The following test can help us measure the publisher confirm latency.

Let’s restart the cluster with:

docker compose restart

Now start a producer with publisher confirm enabled for every message:

# producer at rmq1
java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5672 \
--producers 1 \
--consumers 0 \
--confirm 1 \
--rate 100 \
--queue haq \
--auto-delete false

Start a consumer so that messages do not overflow RabbitMQ:

# consumer at rmq2 
java -jar perf-test-2.21.0.jar \
--uri amqp://localhost:5673 \
--producers 0 \
--consumers 1 \
--queue haq \
--auto-delete false

Since the default net tick timeout is 60s, we can create a network disruption under this value and observe the publisher confirm latency.

docker network disconnect rabbit rmq2
sleep 10
docker network connect rabbit rmq2

From the producer’s output, we can see that there was a delay of around >10s when the producer was waiting for the confirm.

id: test-123043-769, time 4.002 s, sent: 92 msg/s, confirmed: 91 msg/s, nacked: 0 msg/s, min/median/75th/95th/99th confirm latency: 714/2051/2418/3872/6911 µs

==> id: test-123043-769, time 17.000 s, sent: 3.0 msg/s, confirmed: 3.0 msg/s, nacked: 0 msg/s, min/median/75th/95th/99th confirm latency: 305/436/1106/3100/13041623 µs
id: test-123043-769, time 18.005 s, sent: 166 msg/s, confirmed: 167 msg/s, nacked: 0 msg/s, min/median/75th/95th/99th confirm latency: 319/725/1468/2319/3405 µs

If we increase the network disconnection to ~30s, the producer even stopped since it was waiting for the confirm for too long.

id: test-135347-579, time 4.001 s, sent: 100 msg/s, confirmed: 100 msg/s, nacked: 0 msg/s, min/median/75th/95th/99th confirm latency: 550/2305/2879/3709/8026 µs
id: test-135347-579, time 5.005 s, sent: 1.00 msg/s, confirmed: 0 msg/s, nacked: 0 msg/s, min/median/75th/95th/99th confirm latency: 0/0/0/0/0 µs

==> test stopped (Waiting for publisher confirms for too long)
id: test-135347-579, sending rate avg: 11 msg/s
id: test-135347-579, receiving rate avg: 0 msg/s
id: test-135347-579, confirm latency min/median/75th/95th/99th 412/2238/2746/3731/6060 µs

Still, we highly recommend publisher confirms for message reliability. The confirmation window does not need to be one, it can vary depending on the producer’s ability. It needs thorough tests to find out the most optimised publisher confirm window for your system.

When using quorum queues in a three-node setup, the system can tolerate network issues affecting a single node without causing any noticeable delay from the publisher’s perspective. This is because quorum queues use a distributed consensus algorithm to ensure that a majority of nodes (in this case, two out of three) agree on the state of the queue before confirming the receipt of a message to the publisher.

So, if one node experiences a network problem, the other two nodes can still reach a consensus and continue processing messages as usual. This results in no visible delay for the publisher, as their messages are still being accepted and processed by the remaining nodes. This makes RabbitMQ’s three-node setup with quorum queues a highly resilient choice for systems where maintaining a steady flow of messages is critical, even in the face of network issues.

Mirrored queue best practices

While RabbitMQ Classic Mirrored Queues provide high availability, they have limitations and are now deprecated in favour of Quorum Queues. Understanding their behaviour and proper configuration can help manage existing systems effectively. Transitioning to Quorum Queues should be the long-term goal for better performance and reliability.

If using mirrored queues is still necessary for your system, we recommend implementing the following settings for mirrored queues:

Use pause_minority cluster partition setting with an odd number of nodes in the cluster.
Only cluster durable, non-exclusive, non-auto-delete queues.
Mirror to all nodes.
Use automatic synchronisation to minimise the chance of ending up with unsynchronised followers.
Set the HA promotion to "always", promote a follower even if there are no synchronised followers.
Prepare to handle lost messages in the application logic.

We recommended the following mirroring policy for the above:

{"ha-mode":"all",
 "ha-promote-on-failure":"always",
 "ha-promote-on-shutdown":"always",
 "ha-sync-mode":"automatic"}

Quorum queues

Quorum queues are now the recommended solution for high availability. They are resilient to network partitions and use the Raft protocol for leader election and message distribution, ensuring better performance and consistency.

Here are some key differences between quorum queues and mirrored queues:

Reliability: Quorum queues are more reliable and predictable. New followers will be replicated to asynchronously in the background, causing no unavailability of the queue.

Maintenance: Quorum queues require less maintenance. They are designed to be safer and provide simpler, well-defined failure-handling semantics.

Limitations: Quorum queues have some limitations and differences in behaviour compared to mirrored queues. For instance, quorum queues do not support exclusive queues and message priority, and they handle network partitions differently.

Mirrored queues were deprecated in RabbitMQ version 3.9, with a formal announcement posted on August 21, 2021. They will be removed entirely in version 4.0.

RabbitMQ Deprecation Announcements for 4.0 | RabbitMQ

Quorum queues are a superior replacement for mirrored queues. They are safer, achieve higher throughput, and are more reliable and predictable. However, they are not 100%-compatible feature-wise with classic mirrored queues, but close.

For more in-depth information on quorum queues, please visit

RabbitMQ Quorum Queues Explained – what you need to know (seventhstate.io)

Cluster setup

Below is the Docker compose file for the RabbitMQ cluster setup used in this blog.

For more in-depth information on quorum queues, please visit

# docker-compose.yml
version: "3.6"
services:
  rmq1: &rabbitmq
    image: rabbitmq:3.12.0-management
    hostname: rmq1
    container_name: 'rmq1'
    environment:
      RABBITMQ_ERLANG_COOKIE: rabbitmq
    ports:
        - 5672:5672
        - 15672:15672

    networks:
        - rabbit
  rmq2:
    << : *rabbitmq
    hostname: rmq2
    container_name: 'rmq2'
    ports:
        - 5673:5672
        - 15673:15672
networks:
  rabbit:
    driver: bridge
    name: rabbit

To start the cluster, use:

docker compose up -d

To stop the cluster, use:

docker compose down

To restart the cluster, use:

docker compose restart

Migrating from mirrored queues to quorum queues can seem like a daunting task. If you’d like some guidance talk to us about our RabbitMQ consultancy services.

Anh Nguyen
RabbitMQ Support Expert – Seventh State