Raft deadlock during member removal #1379
-
Hey, My As long as I am not losing my state, everything is fine and stable. Node restarts, full cluster shutdowns and restarts, all working good. But, what I am currently investigating in regard to disaster recovery is the situation where you would lose a node for whatever reason, which then cannot come up again and needs to be replaced, e.g. because of a broken volume. Sometimes, when I just restart the broken node with an empty volume and allowed I tried different ways of doing this in an automated way for a few days now, but I always end up in an unstable situation, where the whole cluster basically locks itself up, I need to kill it and do a clean restart. I tried
And the issue is, it gets stuck so badly, the the whole application stops working. It seems like it's actually blocking the tokio runtime, is that possible? The most weird part about it is, that the other healthy Follower then stops working as well, and is getting into a locked state, which is the reason why the whole cluster just dies, after it logs:
Now my question: Am I doing something obvious wrong, missing something, or is it actually possible that such a situation comes up and if so, how can I avoid it? Edit: As long as I did not mess up somewhere else, I can confirm that it completely locks the whole tokio runtime on all nodes. I enabled |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 25 replies
-
I tried digging a bit deeper into the cause, but unfortunately it's not that easy to see if you're unfamiliar with the From knowing when this happens every time, it must be related to removing the "broken" and offline cluster member. The console in the screenshot is connected to the process of the current cluster Leader. Did I mess up badly, or is it known (and perhaps fixed on Edit: I also noticed, that the Leader, after removing the node from Members and removing and -starting all replication, that it also creates a new replication client for the just removed node, which is something I would not expect after a removal. In the example logs below e.g., I simulated a lost Raft state volume on Node 2, which is first being removed from the cluster members, but then a new Client stream for that node is opened anyway? The connection of course fails as expected, because the node is offline and should be removed from the cluster.
Edit 2: Just to make sure that my cluster leave logic works in general, I did the cluster leave automatically before shutting down the raft. This works just fine, as expected, and finishes basically immediately. When this node comes up again (in-memory logs store for volume loss simulation), it re-joins and syncs Snapshot + Logs and everything is just fine. However, when I call |
Beta Was this translation helpful? Give feedback.
-
It is the latest 0.9 release, right? The latest is 0.9.18. I'm going to have a look to see if it's possible to have such an issue. |
Beta Was this translation helpful? Give feedback.
-
First of all, I don't yet have an idea what kind of situation would lead to such an issue. I need your help to find out what went wrong. First, please enable debug level logging. It will print a lot of other actions that happened during execution. And please add some logging to your trait implementations, such as outputting a log just after entering And let me confirm with you about reproducing this issue with the problematic node:
And I assume you were running a 3-node cluster with nodes 1, 2, 3, right? |
Beta Was this translation helpful? Give feedback.
-
Now I can run the cluster: while all 3 nodes online, ctrl-c to shut down node-3; then restart it. What should I expect to see? It looks like on my laptop node-1 and node-2 works fine. They keep outputting logs on the terminal. Does it mean no deadlock happened? |
Beta Was this translation helpful? Give feedback.
-
I use your example to start all 3 nodes. But it still looks like there are two raft instance running inside each process? There are some log with text: node-3 log
And node-3 can not connect to the cluster:
I'm gonna test it on another computer tomorrow to see if it just does not work on my laptop 🤔 |
Beta Was this translation helpful? Give feedback.
-
Finally I can reproduce the deadlock issue |
Beta Was this translation helpful? Give feedback.
-
I noticed there is a blocking send in the ![]() It blocks when node-3 re-join the cluster. The reason it blocks may be because the receiving end has destroyed or else. Change it to impl Drop for NetworkConnectionStreaming {
fn drop(&mut self) {
let _ = self.sender.try_send(RaftRequest::Shutdown);
if let Some(task) = self.task.take() {
task.abort();
}
}
} I tried update the channel buffer size from 1 to 100 but it does not change anything 🤔 impl RaftNetworkFactory<TypeConfigSqlite> for NetworkStreaming {
type Network = NetworkConnectionStreaming;
async fn new_client(&mut self, _target: NodeId, node: &Node) -> Self::Network {
let (sender, rx) = flume::bounded(1); // may be blocked if the receiving end does not consume the message. |
Beta Was this translation helpful? Give feedback.
I noticed there is a blocking send in the
drop
method:https://github.com/sebadob/hiqlite/blob/60337939de8bde26ab09c06a3f43122b735fdb80/hiqlite/src/network/raft_client.rs#L511-L518
It blocks when node-3 re-join the cluster. The reason it blocks may be because the receiving end has destroyed or else.
Change it to
try_send()
, a non-blocking mode send, it seems node-3 can smoothly re-join the cluster.I tried update the channel buffer size from 1 to 100 but it does not chan…