RepliQ: keep RabbitMQ steady across zones
Seventh State RepliQ keeps your cluster sane. It spreads queue copies across zones and, when needed, grows or trims them to a simple target: three copies, one per zone. Once that’s safe, it nudges leaders so no single AZ gets hammered. It keeps doing this as things move around. It works with any number of zones. (And no, you dont need to babysit it.)
Background: Regions and Zones
Cloud providers organise compute into Regions (geographies). Each Region has multiple Availability Zones (AZs): separate data centres on independent power and network, linked with low‑latency, high‑bandwidth links. Treat a zone as a failure domain. Edge locations are CDN/POP sites; they aren’t where you run RabbitMQ.
Topology patterns
- Single‑region, multi‑zone — Default for most fleets; all nodes live in one Region, spread across zones.
- East/Central/West sites — Separate clusters per geography for latency; connect with federation/shovels.
- On‑prem racks/campuses — Treat racks or rooms as zones; same rules apply.

Note: “separate power/network; low‑latency links.
Introduction
RabbitMQ across zones only works if queue copies are spread and stay that way. RepliQ keeps each quorum queue at three copies, one per zone. It grows a missing copy when needed, trims extras when they appear, and maintains that state as nodes and queues change. Once coverage is right, it balances leaders so no zone or node becomes a hotspot.
Who it’s for
Teams running RabbitMQ across multiple zones or racks who want zone‑loss tolerance and steady latency without manual rebalancing. Typical users: SRE/platform teams and systems like payments, IoT fan‑in, and chat backends where queues come and go daily.
What RepliQ guarantees
Spread: Three copies, one per zone. If a queue lands unevenly, RepliQ adds the missing copy, waits for it to sync, then removes extras. Works with any number of zones.
Safe moves: If the broker rejects a change, RepliQ backs off and retries later. Operations are idempotent; partial repairs are fine.
Leaders after coverage: Once copies are in the right zones, RepliQ redistributes leaders to remove hotspots. Replicas stay put.
Why multi‑AZ placement matters
When copies bunch up in one zone you’re one power cut away from trouble. New queues, autoscaling, and maintenance all shift the deck over time. RepliQ fixes the spread first—grow where you’re missing, trim where you have extras—then tidies leaders so load is even. The goal stays simple: one copy per zone, leaders not all in the same place.




Concrete patterns
Single-region, multi-AZ
Example: 9 nodes across us-east-1a/b/c. Set T=1 so RF_min=3. RepliQ enforces per-AZ caps for replicas, so a full-AZ outage still leaves quorum. After coverage is healthy, it balances leaders.

East / Central / West footprint
Operate separate RabbitMQ clusters per site for locality. Link with federation or shovels. Run RepliQ inside each cluster to keep placement AZ-resilient locally.

On-prem racks or campuses
Treat racks or rooms as AZ labels (r1, r2, r3). RepliQ spreads replicas across those labels and chooses the least-crowded node within each AZ.

How it works
RepliQ listens for changes and queues up work. On a short timer it works through batches: decide the right set of members, add, wait to sync, then remove extras. Default target is three copies, one per zone. Only after that looks good does it move leaders. If the broker says “not permitted,” the item goes back on the list and tries again. Zone names come from each node’s env; you set it once in thier env and forget it.
Processing pipeline

Creating Two Quorum Queues


Two queues landed unevenly. RepliQ first adds the missing copy in the empty zone. You’ll briefly see four copies while it syncs.

Sync completes, then the extra copy is trimmed. We’re back to three copies, one per zone—no risk window.

Coverage is correct, so leader moves are enabled. One leader starts moving off the busy zone.

Move finishes. Leaders per zone are even. Copies didn’t change; clients keep flowing.



Install in a minute
rabbitmq-plugins enable seventh_state_repliq
# Per node (set appropriately)
export RABBITMQ_LOCAL_AZ=us-east-1a
# Health
rabbitmqctl repliq_get_az_status
rabbitmqctl repliq_get_placement_health
# Balance leaders after coverage is healthy
rabbitmqctl repliq_balance_leaders --dry-runrabbitmqctl repliq_balance_leaders
What to watch
Use two simple charts: replicas per AZ and leaders per AZ. After changes, both converge quickly to flat lines. Keep label cardinality low.
Scope
Quorum queues only. Defaults target RF=3 and one‑per‑AZ coverage; you can adjust tolerated‑AZ‑failures if your policy requires it. Unlimited AZ count by design.
Frequently Asked Questions
How many AZs can it handle?
Unlimited by design. AZs are labels; placement rules and safety bounds generalize to any N.
What if RabbitMQ denies a change?
RepliQ respects safety errors, retries later, or performs partial repairs; operations are idempotent.
Does it add overhead?
Batched, bounded churn with targets set to keep CPU and memory overhead low in large clusters.
“RepliQ maintains AZ‑resilient quorum placement and removes leader hotspots with minimal operator effort. Enable it, set AZs, check health, and let it work.
For a demo or install help, contact Seventh State. “
Thomas Bhatia | RabbitMQ Consultant




