-
Notifications
You must be signed in to change notification settings - Fork 20.8k
Memory leak on high JSON RPC load #22529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you share a sample set of JSON rpc requests with us, so we can replicate it better? |
Sure, I have recorded the requests with the timestamp at the point of the mem leak. Here you go. It's basically eth_getLogs and eth_call. Later we noticed that the mass of the calls was due to a bug in the GUI, so it was a kind of unintentional load test on geth. In the attached log, you'll find the detailed log of the calls during the mem leak including the timestamps. Moreover, there is a second file of some 50 calls. When I push them few times to the geth node, I can reproduce the mem leak. |
Hmm so I can't really repro the memory leak, But I can see that there is a lot of memory being allocated during the filtering. I noticed you'll filter the whole chain on every request. This means that the node needs to retrieve every header for every block for every call. This quickly adds up. In my case (after sending ~300.000 requests, this allocated ~280GB of memory. The memory is immediately freed though once the requests is done. |
@maran thanks for looking into this. Yes, we're using lot of Logs and collecting them from block 0 on. With 10M Blocks, it should be around 1G of memory for that only. But once in memory, is this header data reused for other requests, or each request fetches the headers on its own? I'll repeat the load test on my side and send you the mem profile. |
There is a cache afaict for the headers, which grows with time (~1.9GB on my machine), but it seems most request will still retrieve a new header. Also the mapping blocknumber -> blockhash is not cached and has to be retrieved from DB on every request.
|
Here's the memprofile. I pushed the attached ~50 requests ~20 times via two wscat instances. The mem consumption jumped from 770M to 3.2G. If I continue the game, I can bring it to 32G+, which leads to the crash of geth. EDIT: Mem goes only up, when the messages are pushed to geth fast so it comes to parallel execution. If I send the requests slowly, everything is fine. So doesn't seem to be a mem leak actually as you already said. |
Yeah, so what I think happens is that whenever a new request comes in, the handler asks the database for the compressed BloomBits for a section. It then tries to decompress the section into 512 bytes. This decompressed Bloom-bitset is stored in the task and kept until the task is finished. If many tasks are started concurrently that blows up, as this memory is guaranteed to be allocated for the duration of the call. So memory comes back down if you stop sending requests? |
@MariusVanDerWijden On OS level, I can only observe that when the memory gets allocated for the process once, it stays allocated. I can only observe that when I send in the requests slowly it doesn't grow. When I send it in fast in parallel, it grows fast and the requests take longer. |
Hmm yeah so we talked a bit about this issue internally. The problem is that the OS can not accurately see how much memory geth consumes. We do have metrics builtin that can be used to measure the ram usage accurately. |
@MariusVanDerWijden here you go. I did a mem dump every ~1min. After bringing him to 5G, I continued dumping the mem profile for ~5min. |
Yeah just as I thought, during extensive load the memory blows up in the bloomhandler since every request allocates 512 bytes that are retained for the duration of the call. After the calls are finished, the memory is deallocated. So there's no memory leak there. One thing we might look into is if we should cap the memory available to allocate for the calls. memprofile-10:
memprofile-13:
Thank you very much for the data @ivica7 ! |
@MariusVanDerWijden thank you for analyzing the problem so fast! Just wanted to ask: would it make sense to add a limit for how many requests can be processed in parallel. Adding some kind of worker pool? Eventually controlled via a CLI param like And one more bonus question, if you have time: What I always wanted to know: when filtering events, bloom filters are used. These filters are checked block by block, right? There is no hierarchy of bloom filters, so you can exclude multiple blocks at once? So for our case with 11M blocks, that would mean that there are 11M bloom filter checks when getting logs |
Yeah, we've been discussing something like that already. I've implemented something similar for tracing here #21755 but that isn't finalized yet. Yes afaict the bloom filters are checked block by block, see go-ethereum/eth/filters/filter.go Line 218 in e862cbf
I think that's the big problem in your use case, the getLogs take long to execute and retain a lot of memory per request, thus the next requests take even longer, retaining their memory longer, and so on. I would suggest (if that fits to your use case) to limit the from field in getLogs to something more reasonable.
I'll close the issue for now, if that's okay |
v1.10.4 is leaking with minimal RPC load now. |
Version: 1.9.25-stable
Git Commit: e787272
Git Commit Date: 20201211
Architecture: amd64
Protocol Versions: [65 64 63]
Go Version: go1.15.6
Operating System: linux
RAM: 128Gb
We have a private network with clique with some 11M blocks. Raw geth database size is around 8Gb. --cache is 8192. RPC and WS ports are open. The node is not mining, it is exclusively used for the GUI to fetch the data from the blockchain.
The problem that we have, as soon as the gui send a heavier load of eth_getLogs and eth_call JSON requests (~8000 calls within 5min) the memory usage dramatically increases. Within 5min it jumps from initial 2Gb to 32Gb and finally it crashes at some point. Moreover, the system continuously decreases in performance, so every further call is further slowing the system down, which leads to even heavier RAM load. When we stop sending JSON RPC requests, the memory is not released.
I have captured all JSON requests, and if I execute them one by one, there are no problems at all. Mem usage does not increase, the response times are good. But as soon as I start pushing them in parallel to the node the drastic memory increase is guaranteed.
My assumption is that there is a memory leak occurring when JSON RPC calls run concurrently.
The text was updated successfully, but these errors were encountered: