Skip to content

deadlock issue of v1.15.6 on eth_getLogs for filter & thus shutdown corruption #31700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jun0tpyrc opened this issue Apr 24, 2025 · 17 comments · Fixed by #31708
Closed

deadlock issue of v1.15.6 on eth_getLogs for filter & thus shutdown corruption #31700

jun0tpyrc opened this issue Apr 24, 2025 · 17 comments · Fixed by #31708

Comments

@jun0tpyrc
Copy link

jun0tpyrc commented Apr 24, 2025

System information


{"jsonrpc":"2.0","id":1,"result":"Geth/v1.15.9-stable-74165a8f/linux-amd64/go1.24.2"}

Expected behaviour

can answer in reasonable time

Actual behaviour

slow ,timeout for 1minute, the same call can finish on geth 1.15.3 on some api provider / erigon3.0.2 returning immediately

Steps to reproduce the behaviour

 curl 0:8545 -X POST   -H "Content-Type: application/json"   --data '{"method":"eth_getLogs","params":[{"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"}],"id":1,"jsonrpc":"2.0"}'

Backtrace

[backtrace]

When submitting logs: please submit them as text and not screenshots.

@jun0tpyrc
Copy link
Author

jun0tpyrc commented Apr 24, 2025

other calls are fine without slowdown

@rjl493456442
Copy link
Member

The range of block is

>>> 0x154ce96
22335126
>>> 0x154cefa
22335226

@jun0tpyrc jun0tpyrc changed the title peformance regression of 1.15.9 onf eth_getLogs for some range ?! peformance regression of 1.15.9 on eth_getLogs for some range ?! Apr 24, 2025
@wdsse
Copy link

wdsse commented Apr 24, 2025

+1

1 similar comment
@22november1963
Copy link

+1

@jun0tpyrc
Copy link
Author

think this one might be important
saw this glitch causing different l2 setups run on the l1 geth be stuck too
reproduced on 4 geth nodes of this version

@murat-deniz1
Copy link

+1

is this issue releated to #31589 ?

@fjl
Copy link
Contributor

fjl commented Apr 24, 2025

Is the timeout a repeatable issue, i.e. does it always occur with the same query?

@jun0tpyrc
Copy link
Author

Is the timeout a repeatable issue, i.e. does it always occur with the same query?

definitely reproducible , i can see it on 4 different nodes with different filters (even change to single block range ,or a lower traffic address) , it stucks easily

feel like some deadlocks thing

May be not first appearing on 1.15.9 (our upgrade path was alltools-v1.14.13 -> alltools-v1.15.9 docker)

Sounds similar to #31589 mentioned

@rjl493456442
Copy link
Member

rjl493456442 commented Apr 24, 2025

@jun0tpyrc
Can you please share the query you used?

 curl 0:8545 -X POST   -H "Content-Type: application/json"   --data '{"method":"eth_getLogs","params":[{"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"}],"id":1,"jsonrpc":"2.0"}'

We can get the result instantly for this query:

> eth.getLogs({"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"})
[]

@jun0tpyrc
Copy link
Author

jun0tpyrc commented Apr 24, 2025

@jun0tpyrc Can you please share the query you used?

 curl 0:8545 -X POST   -H "Content-Type: application/json"   --data '{"method":"eth_getLogs","params":[{"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"}],"id":1,"jsonrpc":"2.0"}'

We can get the result instantly for this query:

> eth.getLogs({"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"})
[]

it just being same as your above one , on different geth api provider (checked by rpc method) or erigon we could also get the [] (empty result ) instantly but not for the geth nodes we have now

Furthermore I just found shutdown peacefully becomes a problem on this version - took like 20min and stuck without the final peaceful Blockchain stopped message

INFO [04-24|12:36:38.215] Looking for peers                        peercount=1   tried=42  static=1
WARN [04-24|12:36:38.389] Already shutting down, interrupt more to panic. times=9
..
WARN [04-24|12:52:00.769] Beacon client online, but no consensus updates received in a while. Please fix your beacon client to follow the chain!
INFO [04-24|12:52:07.761] Looking for peers                        peercount=0   tried=64  static=1
INFO [04-24|12:52:17.835] Looking for peers                        peercount=0   tried=58  static=1
INFO [04-24|12:52:27.943] Looking for peers                        peercount=0   tried=72  static=1
INFO [04-24|12:52:37.966] Looking for peers                        peercount=1   tried=56  static=1
INFO [04-24|12:52:48.012] Looking for peers                        peercount=0   tried=45  static=1

main running flag

geth --networkid=1 --config=/var/lib/geth/geth.toml --http.port=8545 --cache=102400 --http --http.addr=0.0.0.0 --http.corsdomain=* --http.vhosts=* --gcmode="full" --snapshot=false --txlookuplimit=0 --cache.preimages --authrpc.jwtsecret="/opt/jwtsecret" --authrpc.vhosts="*" --authrpc.port="8551" --authrpc.addr="localhost"

@jun0tpyrc
Copy link
Author

jun0tpyrc commented Apr 24, 2025

1.15.9 cannot shut down normally (after 25min forcefully stop)
1.15.9 -> 1.14.13 (cannot start, won't work for db incompatibility as mentioned)
-> 1.15.6 (can start ,cannot solve call timeout issue) (although my db-state starts to corrupt and could not move forward for new blocks, can still examine for old block)
-> 1.15.3 (can start ,solve call timeout issue,the call return immediately)
-> 1.15.4 (can start ,solve call timeout issue,the call return immediately)
-> 1.15.5 (can start ,solve call timeout issue,the call return immediately)
-> 1.15.6 (can start ,call timeout issue is back )

This sounds be
1.15.5-> 1.15.6 problem

@fjl
Copy link
Contributor

fjl commented Apr 24, 2025

@jun0tpyrc Could you do something for us please. While your query is hanging, please attach to Geth via console (geth attach) and run debug.stacks(), then post the output here. It will be a lot of output, so you may have to post it via gist.github.com. Thank you.

@jun0tpyrc
Copy link
Author

jun0tpyrc commented Apr 24, 2025

a almost 20MB one

this was while running

curl 0:8545 -X POST   -H "Content-Type: application/json"   --data '{"method":"eth_getLogs","params":[{"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"}],"id":1,"jsonrpc":"2.0"}'

gethtraces.txt

@fjl
Copy link
Contributor

fjl commented Apr 24, 2025

Thank you!

@jun0tpyrc jun0tpyrc changed the title peformance regression of 1.15.9 on eth_getLogs for some range ?! peformance regression of 1.15.6 on eth_getLogs for even single block/smalle range filter Apr 24, 2025
@jun0tpyrc jun0tpyrc changed the title peformance regression of 1.15.6 on eth_getLogs for even single block/smalle range filter peformance regression of 1.15.6 on eth_getLogs for even single block/small range filter Apr 24, 2025
@zsfelfoldi
Copy link
Contributor

a almost 20MB one

this was while running

curl 0:8545 -X POST   -H "Content-Type: application/json"   --data '{"method":"eth_getLogs","params":[{"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"}],"id":1,"jsonrpc":"2.0"}'

gethtraces.txt

Thanks for the detailed report and the stack dump, I found the issue. A deadlock indeed, happening when a search is happening while an old part of the chain gets unindexed. Also responsible for the shutdown issue.
One question though, just to confirm my hypothesis: are you running your node in hashdb mode (archive node)? Or are you using the old leveldb database instead of pebbledb?

@jun0tpyrc
Copy link
Author

a almost 20MB one
this was while running

curl 0:8545 -X POST   -H "Content-Type: application/json"   --data '{"method":"eth_getLogs","params":[{"fromBlock":"0x154ce96","toBlock":"0x154cefa","address":"0x918778e825747a892b17C66fe7D24C618262867d"}],"id":1,"jsonrpc":"2.0"}'

gethtraces.txt

Thanks for the detailed report and the stack dump, I found the issue. A deadlock indeed, happening when a search is happening while an old part of the chain gets unindexed. Also responsible for the shutdown issue. One question though, just to confirm my hypothesis: are you running your node in hashdb mode (archive node)? Or are you using the old leveldb database instead of pebbledb?

these are non-archive node with old leveldb datadir

@jun0tpyrc jun0tpyrc changed the title peformance regression of 1.15.6 on eth_getLogs for even single block/small range filter deadlock issue of v1.15.6 on eth_getLogs for filter & thus shutdown corruption Apr 24, 2025
@zsfelfoldi
Copy link
Contributor

these are non-archive node with old leveldb datadir

Thanks, leveldb does cause the history unindexing to run in a fallback mode that wasn't tested properly and interfered with a log search. This PR should fix the issue: #31704

zsfelfoldi pushed a commit that referenced this issue Apr 25, 2025
This PR fixes a deadlock situation is deleteTailEpoch that might arise
when
range delete is running in iterator based fallback mode (either using
leveldb
database or the hashdb state storage scheme). 

In this case a stopCb callback is called periodically that does check
events,
including matcher sync requests, in which case it tries to acquire
indexLock
for read access, while deleteTailEpoch already held it for write access.

This pull request removes the indexLock acquiring in
`FilterMapsMatcherBackend.synced`
as this function is only called in the indexLoop.

Fixes #31700
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants