-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[FIXED] Server peer re-add after peer-remove #6815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general think this is ok. Trying to think through if a peer keeps trying to come back and we do not want it to, the time threshold may be too low?
server/raft.go
Outdated
} | ||
n.removed[peer] = struct{}{} | ||
n.removed[peer] = time.Now().UnixNano() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use new access time here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That feels like a pre-optimization, and don't really think it's needed. This only gets run once when a peer is removed, which shouldn't be happening often enough to need to use the new getAccessTime()
in my opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not pre since we know this can affect the system (UnixNano).. You could make the map take a time vs int64, those are cheap..
Also we can not control how may times peer remove can be called per second atm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to store time.Time
instead
server/raft.go
Outdated
} | ||
n.removed[peer] = struct{}{} | ||
n.removed[peer] = time.Now().UnixNano() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
A peer should only be removed if the server is shutdown and we want to indicate it isn't coming back. So the peer shouldn't be trying to come back in that situation? |
2a3f2f7
to
80e6d36
Compare
Updated to use EDIT: updated to 5 minutes. |
Why not like 5 minutes? Meaning what are we basing this on? Should it be configurable? |
It can be any value, but it should be short enough to be usable. It can be short if you shutdown a server first, and then peer-remove it. Could set it to 5 minutes. But don't think it should be much higher. Don't think it should be made configurable, that will give a false sense of security since you'd expect a peer-removed server can't come back before that time. But.. you can restart all servers to be able to re-admit it (since TLDR; does 5 minutes non-configurable sound okay? |
Signed-off-by: Maurice van Veen <[email protected]>
80e6d36
to
8875c26
Compare
Could be 2 or 5. We could also add in a method to clear the removed state from the system before we bring up the server again? |
Have changed it to 5 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Includes the following: * #6815 * #6825 * #6827 Signed-off-by: Neil Twigg <[email protected]>
A server can be shutdown and then peer-removed from the cluster to have streams/consumers moved to a different server (if replicated), since it indicates a server will not be coming back.
However, imagine the following failover scenario/setup:
Although it sounds straightforward enough, you'd currently need to restart ALL servers in the (super) cluster to allow server X to come back like that. This would be a hidden requirement if the server X would only come back after some months, and servers would have been upgraded/restarted already by then.
To make it more obvious/simple what happens:
Signed-off-by: Maurice van Veen [email protected]