DRS: Disaster Recovery & Data Salvage for RabbitMQ

When downtime isn’t an option, you need more than RabbitMQ clustering.

RabbitMQ is trusted to keep critical systems moving but traditional clustering wasn’t designed for full data-centre loss or complex disaster recovery. If your platform goes dark, the real cost isn’t just technical it’s

lost revenue
lost customer trust
long nights trying to bring systems back.

That’s why we built DRS: Disaster Recovery & Data Salvage for RabbitMQ.

DRS keeps you online or gets you back fast, even if an entire site goes down and you’re left with a single surviving node.

It’s the safety net for teams that can’t compromise on uptime.

✨ DRS is available EXCLUSIVELY for Seventh State Support Customers ✨

Explore Support to access DRS

Request a Demo

Here’s how it works…

In today’s distributed systems landscape, ensuring high availability and robust disaster recovery is paramount. RabbitMQ, a popular and reliable messaging broker, has long been a cornerstone for applications requiring dependable message delivery. However, traditional RabbitMQ clustering has its limitations, especially when it comes to multi-data centre deployments and disaster recovery.

Enter the DRS: Disaster Recovery & Data Salvage RabbitMQ Plugin, a game-changing solution designed to extend RabbitMQ’s capabilities and enable safe clustering across two data centres.

Before we dive into the specifics of the DRS RabbitMQ Plugin, let’s clarify some key concepts that are fundamental to understanding system resilience.

Understanding Disaster Recovery and High Availability

Disaster recovery (DR) refers to the strategies and processes an organisation implements to quickly resume critical business functions after a disruptive event. This could be anything from a natural disaster to a cyber attack. The goal of DR is to minimise downtime and data loss.

High availability (HA) focuses on ensuring that a system remains operational and accessible for a high percentage of time, typically measured in “nines” (e.g., 99.999% uptime). HA systems are designed to eliminate single points of failure through redundancy and failover mechanisms. The goal is to minimise or eliminate planned and unplanned downtime.

Now that we’ve established these foundational concepts, let’s explore how the DRS RabbitMQ Plugin addresses these crucial aspects of system resilience.

Real-World Scenario: Solving Single Data Center Outages with the DRS Plugin

Imagine you’re running a critical application that relies on a 3-node RabbitMQ cluster, all hosted in a single data centre. One day, disaster strikes: your data centre experiences a complete outage, rendering your entire RabbitMQ cluster inaccessible. This leads to significant downtime for your application, potentially causing data loss and impacting your business operations.

The Challenge

In this scenario, the primary issues are:

Lack of geographic redundancy

No immediate failover option

Potential for data loss during the outage

Enter the DRS RabbitMQ Plugin

The DRS RabbitMQ Plugin is designed precisely for situations like this. It extends RabbitMQ’s capabilities to support safe clustering across two data centres, providing a robust solution for recovering from catastrophic failures and allowing seamless transitions between data centres.

How the Plugin Solves the Problem

Here’s how you can use the DRS plugin to address this vulnerability:

Expand Your Infrastructure

Set up a secondary data centre with three additional RabbitMQ nodes. This gives you a total of six nodes across two geographically separated locations.

Configure the DRS Plugin:

Designate the three nodes in your original data centre as “active” nodes. Configure the three nodes in the secondary data centre as “passive” nodes. Enable data mirroring from active to passive nodes.

Normal Operation

During regular operation, your original 3-node cluster in the first data centre will handle all traffic, while the passive nodes in the secondary data centre act as hot standbys.

In Case of a Data Center Outage

If your primary data centre becomes inaccessible, you can initiate the simple failover process:

– Promote the passive nodes in the secondary data centre to active status using the command

– Redirect traffic to the newly activated nodes.

After Resolving the Outage

Once the original data centre is back online, you can:

– Rebuild the cluster with the original nodes.

– Reconfigure them as passive nodes.

– Gradually shift traffic back if desired.

Key Benefits of This Approach

Disaster Recovery: You gain the ability to recover from a complete data centre outage.

Data Preservation: The passive nodes maintain a copy of your data, minimising potential data loss.

Controlled Failover: The manual promotion process allows for careful orchestration during critical events.

Flexibility: You can choose to keep the secondary site active or fail back to the original site once it’s restored.

Introducing the DRS RabbitMQ Plugin

The DRS RabbitMQ Plugin is an innovative solution designed to extend RabbitMQ’s capabilities, enabling safe clustering across two data centres. This plugin aims to provide a robust solution for recovering from catastrophic failures and allows operators to transition seamlessly between data centres in case of application failure.

Key Features and Functionalities

The plugin introduces several innovative features that set it apart from traditional RabbitMQ setups:

Active and Passive Nodes

The core concept revolves around designating nodes as either active or passive. Active nodes operate similarly to traditional RabbitMQ nodes, handling queue leaders and traffic. They retain the same operational semantics as the open-source version, with queue leaders located on active nodes and cluster availability dependent on the majority of active nodes being available. Passive nodes, on the other hand, serve as hot standby nodes. They mirror data from active nodes, ensuring data availability in case of failure. This setup allows for a more nuanced approach to high availability, where passive nodes can quickly take over if active nodes fail.

Manual Failover Process

One of the key design decisions in this plugin is the manual nature of failover. Promoting passive nodes to active ones is a deliberate process, requiring operator intervention. This approach avoids false positives and ensures data consistency, which is crucial in distributed systems where network partitions can be misinterpreted as node failures.

Flexible Deployment Scenarios

The plugin supports various configurations to suit different needs and environments. Common setups include 3 active + 3 passive nodes (recommended for optimal replication performance), 1 active + 1 passive, 2 active + 1 passive, and 3 active + 2 passive. This flexibility allows organisations to tailor their RabbitMQ setup to their specific requirements and infrastructure constraints.

Custom Partition Handling

The plugin implements a sophisticated “pause minority” logic to handle network partitions. This approach ensures that the active side remains operational while passive nodes restart and synchronise, providing a balance between availability and data consistency.

Publish Confirms

For applications requiring high resiliency, the plugin requires the use of publish confirms. This feature ensures that messages are acknowledged only when safely replicated, but it does not mean the messages made it to the second site.

Queue Management

The plugin recommends pre-creating queues on active nodes to avoid timeouts and undefined behaviour on passive nodes. This proactive approach to queue management helps maintain system stability and predictability.

Disaster Recovery and Business Continuity with the DRS Plugin

The DRS Plugin enhances RabbitMQ’s disaster recovery capabilities by providing:

Cross-Data Center Resilience

By enabling clustering across two data centres, the plugin significantly improves the system’s ability to recover from site-wide failures.

Minimal Data Loss

The hot standby nature of passive nodes ensures that data is continuously mirrored, minimising potential data loss during failover.

Controlled Failover

The manual failover process allows for careful consideration and orchestration during critical events, reducing the risk of unnecessary or premature failovers.

Flexible Recovery Options

The plugin supports various recovery scenarios, from single node failures to complete data centre outages, providing adaptability in different disaster situations.

High Availability Aspects

While primarily focused on disaster recovery, the DRS Plugin also contributes to high availability by:

Eliminating Single Points of Failure

The active-passive node setup ensures that there’s always a backup ready to take over.

Maintaining Data Consistency

Continuous data mirroring between active and passive nodes helps maintain data integrity and consistency.

Supporting Business Continuity

By enabling quick failover between data centres, the plugin helps maintain service availability even during significant disruptions.

Monitoring and Troubleshooting

Effective monitoring is crucial for maintaining a healthy distributed system. The DRS RabbitMQ Plugin provides several tools for this purpose:

Node Status Monitoring

Operators can use RabbitMQ CLI commands to check the status of nodes, including their active/passive roles and clustering connections.

Queue Status Verification

The plugin offers a dedicated interface for verifying queue statuses, ensuring that queues are correctly configured across the cluster.

Configuration Validation

Built-in validation commands help operators ensure that the roles of nodes match the expected configuration, preventing misconfigurations that could lead to instability.

Conclusion

The DRS RabbitMQ Plugin represents a significant advancement in RabbitMQ’s disaster recovery and high availability capabilities. By introducing the concept of active and passive nodes across data centres, it provides a robust solution for organisations requiring high levels of data safety and operational continuity.

While the plugin introduces some complexity in terms of management and failover procedures, the benefits in terms of reliability and disaster recovery capabilities make it a compelling choice for mission-critical messaging systems. As with any advanced distributed system component, proper planning, testing, and operational procedures are essential for leveraging the full potential of this plugin.

Thomas Bhatia - RabbitMQ Consultant, Seventh State

“By incorporating this plugin into a comprehensive business continuity plan, organisations can significantly enhance their ability to respond to and recover from various disruptive events, ensuring that their RabbitMQ infrastructure remains resilient and reliable even in the face of significant challenges.

The DRS Plugin is exclusively available to our RabbitMQ Support customers.”

Thomas Bhatia
RabbitMQ Consultant – Seventh State