Replies: 8 comments 6 replies
-
If you don't have a cluster majority, you cannot use consistent queries. Arguably if this happens |
Beta Was this translation helpful? Give feedback.
-
I looked at the code and it seems like it should work fine. However, this is what I see when running the reproduction test:
I'll investigate. |
Beta Was this translation helpful? Give feedback.
-
Out of interest, I tried the reproduction steps and got the similar result with @orre 's.
The output of ra1 shell is: https://gist.github.com/shino/d8091d0a3ece8156974c99efd021c599 . |
Beta Was this translation helpful? Give feedback.
-
@lukebakken There is a bug re-declaring a new cluster with the same name when there is persisted data from aprevious session that you're hitting. This PR should address that bug: #339 (basically clobbering the old cluster when a new one is declared). I can also see a bug with consistent_query where the query indexes get our of sync after certain election events. Not sure how to fix it yet but am working on it. |
Beta Was this translation helpful? Give feedback.
-
Ok with the latest commits to #339 I left the test program running whilst going to grab some lunch and it was still running when I came back. @orre give it a try. I will need to spend some more time reviewing the code for consistent query to make sure it works as expected before we can merge this change but I think I know what the problem was and should have at least solved the liveness issue. |
Beta Was this translation helpful? Give feedback.
-
Ok @kjnilsson - I've been trying to reproduce the problem all morning, but it seems to be gone with your PR in place! Thanks to all that helped out. |
Beta Was this translation helpful? Give feedback.
-
Ok the fix is now in #340 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[I'm using RA v2.4.0]
We have a lot of "loss of majority" situations in our RA clusters due to "natural causes" that we cannot affect.
I've observed a problem that
ra:consistent_query
seems to timeout/hang in situations where it should not hang at all according to my (possibly limited) knowledge of RAFT/RA.It goes so far that it hangs forever (or at least a very long time), when the cluster has perfectly good majority and an elected leader.
When this occurs,
ra:leader_query
always works perfectly fine AND it is possibly to commit to the log. It is onlyra:consistent_query
that mysteriously hangs.I have created a "reproducer" repo here
NB: hostname (and possibly also your domain) must be adapted in src/ra_kv_store.erl (function: servers()) before running.
Steps to reproduce:
Start 3 erlang nodes
Start cluster from ra2
Run test cycle
Repeating the test cycle will eventually block for a long time.
Example session
So whats going on here? Why is
ra:consistent_query
blocking?Thanks
Örjan
Beta Was this translation helpful? Give feedback.
All reactions