I have the following situation:
Two times a day for about 1h we receive a huge inflow in messages which are currently running through RabbitMQ. The current Rabbit cluster with 3 nodes can't handle the spikes, otherwise runs smoothly. It's currently setup on pure EC2 instances. The instance type is currenty t3.medium, which is very low, unless for the other 22h per day, where we receive ~5 msg/s. It's also setup currently has ha-mode=all.
After a rather lengthy and revealing read in the rabbitmq docs, I decided to just try and setup an ECS EC2 Cluster and scale out when cpu load rises. So, create a service on it and add that service to the service discovery. For example
discovery.rabbitmq. If there are three instances then all of them would run on the same name, but it would resolve to all three IPs. Joining the cluster would work based on this:
That would be the rabbitmq.conf part:
cluster_formation.peer_discovery_backend = dns # the backend can also be specified using its module name # cluster_formation.peer_discovery_backend = rabbit_peer_discovery_dns cluster_formation.dns.hostname = discovery.rabbitmq
I use a policy ha_mode=exact with 2 replicas.
Our exchanges and queues are created manually upfront for reasons I cannot discuss any further, but that's a given. They can't be removed and they won't be re-created on the fly. We have 3 exchanges with each 4 queues.
So, the idea: during times of high load - add more instances, during times of no load, run with three instances (or even less).
The setup with scale-out/in works fine, until I started to use the benchmarking tool and discovered that queues are always created on one single node which becomes the queue master. Which is fine considering the benchmarking tool is connected to one single node. Problem is, after scale-in/out, also our manually created queues are not moved to other nodes. This is also in line with what I read on the rabbit 3.8 release page:
One of the pain points of performing a rolling upgrade to the servers of a RabbitMQ cluster was that queue masters would end up concentrated on one or two servers. The new rebalance command will automatically rebalance masters across the cluster.
Here's the problems I ran into, I'm seeking some advise:
If I interpret the docs correctly, scaling out wouldn't help at all, because those nodes would sit there idling until someone would manually call
rabbitmq-queues rebalance all.
What would be the preferred way of scaling out?