Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667) #13669

mergify · 2025-04-01T17:23:40Z

If a cluster member for whatever reason gets its local state wiped, it has a hard time re-joining the cluster, as the old cluster members will think the node is already a member and reject the request (if mnesia is used).

Proposed Changes

Mnesia: On failure due to 'already a member', ask to leave the cluster first and retry.
Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

I would like early feedback here, as to if this naive approach is even OK, if there should be a limited set of retries, and if the logic should live in rabbit_mnesia or in rabbit_db_cluster?
It feels a bit wonky that a function called can_join_cluster would also try to leave a cluster and try again, so perhaps it would be better if rabbit_db_cluster:join instead initiates the leave and retry request?

This is an automatic backport of pull request #13643 done by [Mergify](https://mergify.com).

This is an automatic backport of pull request #13667 done by [Mergify](https://mergify.com).

…onsider node a member. Khepri: no-op. Khepri is less strict already, and rabbit_khepri:can_join would accept a join request from a node that is already a member (cherry picked from commit dd49cbe) (cherry picked from commit 6d464f9)

(cherry picked from commit 9ba545c) (cherry picked from commit 6592ebd)

(cherry picked from commit e1f2865) (cherry picked from commit 73742e4)

(cherry picked from commit cdeabe2) (cherry picked from commit ae171b5)

(cherry picked from commit 36eb6ca) (cherry picked from commit d809dff)

(cherry picked from commit e6bc6a4) (cherry picked from commit b0eaa57)

SimonUnge and others added 6 commits April 1, 2025 17:23

Fix dialyzer issue.

cf7807b

(cherry picked from commit 9ba545c) (cherry picked from commit 6592ebd)

Return the exception

844fb7b

(cherry picked from commit e1f2865) (cherry picked from commit 73742e4)

Dont handle the exception just let it out there

42c16de

(cherry picked from commit cdeabe2) (cherry picked from commit ae171b5)

Update spec, noconnection is also a possible error

4834e75

(cherry picked from commit 36eb6ca) (cherry picked from commit d809dff)

Naming #13643

6d45ee8

(cherry picked from commit e6bc6a4) (cherry picked from commit b0eaa57)

michaelklishin added this to the 4.0.8 milestone Apr 1, 2025

michaelklishin merged commit 8e998c4 into v4.0.x Apr 1, 2025
270 checks passed

michaelklishin deleted the mergify/bp/v4.0.x/pr-13667 branch April 1, 2025 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667) #13669

Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667) #13669

mergify bot commented Apr 1, 2025

Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667) #13669

Allow a previously reset node to rejoin its original cluster (backport #13643) (backport #13667) #13669

Conversation

mergify bot commented Apr 1, 2025

Proposed Changes

Types of Changes

Checklist

Further Comments