`ra:consistent_query` hangs inexplicably #337

orre · 2022-12-06T10:56:56Z

orre
Dec 6, 2022

[I'm using RA v2.4.0]

We have a lot of "loss of majority" situations in our RA clusters due to "natural causes" that we cannot affect.
I've observed a problem that ra:consistent_query seems to timeout/hang in situations where it should not hang at all according to my (possibly limited) knowledge of RAFT/RA.

It goes so far that it hangs forever (or at least a very long time), when the cluster has perfectly good majority and an elected leader.
When this occurs, ra:leader_query always works perfectly fine AND it is possibly to commit to the log. It is only ra:consistent_query that mysteriously hangs.

I have created a "reproducer" repo here
NB: hostname (and possibly also your domain) must be adapted in src/ra_kv_store.erl (function: servers()) before running.

Steps to reproduce:

Start 3 erlang nodes

rebar3 shell --name [email protected]

rebar3 shell --name [email protected]

rebar3 shell --name [email protected]

Start cluster from ra2

ra_kv_store:start_cluster().

Run test cycle

ra_kv_store:until_block().
ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).

Repeating the test cycle will eventually block for a long time.

Example session

([email protected])1> ===> Booted ra_test                                                                                                                                                                                                                                                                                                                                                                                                                                                   [40/50]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
([email protected])1>                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
([email protected])1>                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
([email protected])1> ra_kv_store:start_cluster().                                                                                                                                                                                                                                                                                                                                                                                                                                                 
Attempting to communicate with node [email protected], response: pong                                                                                                                                                                                                                                                                                                                                                                                                                              
Attempting to communicate with node [email protected], response: pong                                                                                                                                                                                                                                                                                                                                                                                                                              
Attempting to communicate with node [email protected], response: pong                                                                                                                                                                                                                                                                                                                                                                                                                              
{ok,[{ra_kv,'[email protected]'},                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
     {ra_kv,'[email protected]'},                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
     {ra_kv,'[email protected]'}],                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
    []}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
([email protected])2> ra_kv_store:until_block().                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'[email protected]'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'[email protected]'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'[email protected]'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
** exception error: no match of right hand side value {error,                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                       {no_more_servers_to_try,                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                        [{timeout,{ra_kv,'[email protected]'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {timeout,{ra_kv,'[email protected]'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {error,noproc}]}}                                                                                                                                                                                                                                                                                                                                                                                                                                
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)                                                                                                                                                                                                                                                                                                                                                                                                   
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)                                                                                                                                                                                                                                                                                                                                                                                                        
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)                                                                                                                                                                                                                                                                                                                                                                                                        
([email protected])3> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).                                                                                                                                                                                                                                                                                                                                                                                                         
{ok,undefined,{ra_kv,'[email protected]'}}                                                                                                                                                                                                                                                                                                                                                                                                                                                         
([email protected])4> ra_kv_store:until_block().                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'[email protected]'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'[email protected]'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
** exception error: no match of right hand side value {error,                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                       {no_more_servers_to_try,                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                        [{timeout,{ra_kv,'[email protected]'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {timeout,{ra_kv,'[email protected]'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {error,noproc}]}}                                                                                                                                                                                                                                                                                                                                                                                                                                
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)                   
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)                        
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)                        
([email protected])5> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).                         
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},                                             
                                {timeout,{ra_kv,'[email protected]'}},                                             
                                {error,noproc}]}}  
([email protected])6> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])7> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{ok,undefined,{ra_kv,'[email protected]'}}
([email protected])8> ra_kv_store:until_block().                                          
Going for a round...
Restarting server  {ra_kv,'[email protected]'}
Going for a round...
** exception error: no match of right hand side value 
                    {error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                                    {timeout,{ra_kv,'[email protected]'}},
                                                    {error,noproc}]}}
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)
([email protected])9> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])10> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])11> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])12> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{ok,undefined,{ra_kv,'[email protected]'}}
([email protected])13> ra_kv_store:until_block().                                          
Going for a round...
Restarting server  {ra_kv,'[email protected]'}
Going for a round...
** exception error: no match of right hand side value 
                    {error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                                    {timeout,{ra_kv,'[email protected]'}},
                                                    {error,noproc}]}}
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)
([email protected])14> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])15> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])16> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])17> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'[email protected]'}},
                                {timeout,{ra_kv,'[email protected]'}},
                                {error,noproc}]}}
([email protected])18> ra:leader_query(ra_kv_store:servers(), fun(_) -> undefined end).    
{ok,{{9,8},undefined},{ra_kv,'[email protected]'}}

So whats going on here? Why is ra:consistent_query blocking?

Thanks
Örjan

michaelklishin · 2022-12-06T12:22:48Z

michaelklishin
Dec 6, 2022
Maintainer

ra:consistent_query/2 will await a cluster majority to come around plus a leader election to finish.

If you don't have a cluster majority, you cannot use consistent queries. Arguably if this happens
often, you should not be using Raft and Ra.

1 reply

orre Dec 7, 2022
Author

Well, maybe I was too negative describing our situation: It happens that servers crash or are rebooted and then the cluster may loose majority for a brief period of time. But the problem I'm describing is not when there is no majority. On the contrary, majority is established, ra:process_command() works fine but consistent_query() does not.

lukebakken · 2022-12-07T15:46:01Z

lukebakken
Dec 7, 2022
Maintainer

I looked at the code and it seems like it should work fine.

However, this is what I see when running the reproduction test:

(ra1@shostakovich)1> ra_kv_store:start_cluster().
Attempting to communicate with node ra1@shostakovich, response: pong
Attempting to communicate with node ra2@shostakovich, response: pong
Attempting to communicate with node ra3@shostakovich, response: pong
{ok,[{ra_kv,ra1@shostakovich},
     {ra_kv,ra2@shostakovich},
     {ra_kv,ra3@shostakovich}],
    []}
(ra1@shostakovich)2> ra_kv_store:until_block().
Going for a round...
Restarting server  {ra_kv,ra1@shostakovich}
=SUPERVISOR REPORT==== 7-Dec-2022::07:44:08.925701 ===
    supervisor: {<0.283.0>,ra_server_sup}
    errorContext: start_error
    reason: {already_started,<0.276.0>}

I'll investigate.

2 replies

lukebakken Dec 8, 2022
Maintainer

@orre I consistently get the above error when running your code. I have taken your code and modified it here:

https://github.com/lukebakken/rabbitmq-ra-337-ra_test/tree/lukebakken/ra-337

Changes:

Uses shortnames and figures out hostname on its own
Uses logger

To start (each in own terminal):

rebar3 shell --sname ra1
rebar3 shell --sname ra2
rebar3 shell --sname ra3

I don't know why I see a different result.

@kjnilsson ?

orre Dec 8, 2022
Author

There seems to be some random behavior tied to the ID of the servers.
Naming like this: [{ra_kv, '[email protected]'}, {ra_kv, '[email protected]'},{ra_kv, '[email protected]'}]
may give one type of problems (like already_started) as opposed to naming like this: [{ra_kv1, '[email protected]'}, {ra_kv2, '[email protected]'},{ra_kv3, '[email protected]'}] which should always lead to the originally reported problem (consistent_query errors with no_more_servers_to_try).

The (internal) docs says:

A Ra server is a Ra cluster member. Server ID is defined as a pair of {atom(), node()}. Server ID combines a locally registered name and the Erlang node it resides on.

From this I think it should be OK to use the same atom() for all nodes. But reality seems to suggest otherwise...

shino · 2022-12-08T06:30:33Z

shino
Dec 8, 2022

Out of interest, I tried the reproduction steps and got the similar result with @orre 's.

I used Luke's branch
OTP 25.1.1

The output of ra1 shell is: https://gist.github.com/shino/d8091d0a3ece8156974c99efd021c599 .
ra2 and ra3 shells had only "PROGRESS REPORT"s.

0 replies

kjnilsson · 2022-12-08T12:34:38Z

kjnilsson
Dec 8, 2022
Maintainer

@lukebakken There is a bug re-declaring a new cluster with the same name when there is persisted data from aprevious session that you're hitting. This PR should address that bug: #339 (basically clobbering the old cluster when a new one is declared).

I can also see a bug with consistent_query where the query indexes get our of sync after certain election events. Not sure how to fix it yet but am working on it.

1 reply

lukebakken Dec 8, 2022
Maintainer

Ah OK, thank you for the explanation. Stale data strikes again! 🤦‍♂️

kjnilsson · 2022-12-08T13:51:16Z

kjnilsson
Dec 8, 2022
Maintainer

Ok with the latest commits to #339 I left the test program running whilst going to grab some lunch and it was still running when I came back. @orre give it a try.

I will need to spend some more time reviewing the code for consistent query to make sure it works as expected before we can merge this change but I think I know what the problem was and should have at least solved the liveness issue.

1 reply

lukebakken Dec 8, 2022
Maintainer

You da man, #339 appears to fix the issue 👍

orre · 2022-12-09T08:05:31Z

orre
Dec 9, 2022
Author

Ok @kjnilsson - I've been trying to reproduce the problem all morning, but it seems to be gone with your PR in place!

Thanks to all that helped out.
And Good Job with fixing this

1 reply

kjnilsson Dec 9, 2022
Maintainer

Good to hear. Thanks for the repro app it did help a lot finding this.

kjnilsson · 2022-12-09T11:33:21Z

kjnilsson
Dec 9, 2022
Maintainer

Ok the fix is now in #340

0 replies

ra:consistent_query hangs inexplicably #337

Uh oh!

Uh oh!

orre Dec 6, 2022

Replies: 8 comments · 6 replies

Uh oh!

michaelklishin Dec 6, 2022 Maintainer

Uh oh!

orre Dec 7, 2022 Author

Uh oh!

Uh oh!

lukebakken Dec 7, 2022 Maintainer

Uh oh!

lukebakken Dec 8, 2022 Maintainer

Uh oh!

Uh oh!

orre Dec 8, 2022 Author

Uh oh!

shino Dec 8, 2022

Uh oh!

kjnilsson Dec 8, 2022 Maintainer

Uh oh!

lukebakken Dec 8, 2022 Maintainer

Uh oh!

kjnilsson Dec 8, 2022 Maintainer

Uh oh!

lukebakken Dec 8, 2022 Maintainer

Uh oh!

orre Dec 9, 2022 Author

Uh oh!

kjnilsson Dec 9, 2022 Maintainer

Uh oh!

kjnilsson Dec 9, 2022 Maintainer

`ra:consistent_query` hangs inexplicably #337

orre
Dec 6, 2022

Replies: 8 comments 6 replies

michaelklishin
Dec 6, 2022
Maintainer

orre Dec 7, 2022
Author

lukebakken
Dec 7, 2022
Maintainer

lukebakken Dec 8, 2022
Maintainer

orre Dec 8, 2022
Author

shino
Dec 8, 2022

kjnilsson
Dec 8, 2022
Maintainer

lukebakken Dec 8, 2022
Maintainer

kjnilsson
Dec 8, 2022
Maintainer

lukebakken Dec 8, 2022
Maintainer

orre
Dec 9, 2022
Author

kjnilsson Dec 9, 2022
Maintainer

kjnilsson
Dec 9, 2022
Maintainer