Memory leak on high JSON RPC load #22529

ivica7 · 2021-03-19T05:02:13Z

Version: 1.9.25-stable
Git Commit: e787272
Git Commit Date: 20201211
Architecture: amd64
Protocol Versions: [65 64 63]
Go Version: go1.15.6
Operating System: linux
RAM: 128Gb

We have a private network with clique with some 11M blocks. Raw geth database size is around 8Gb. --cache is 8192. RPC and WS ports are open. The node is not mining, it is exclusively used for the GUI to fetch the data from the blockchain.

The problem that we have, as soon as the gui send a heavier load of eth_getLogs and eth_call JSON requests (~8000 calls within 5min) the memory usage dramatically increases. Within 5min it jumps from initial 2Gb to 32Gb and finally it crashes at some point. Moreover, the system continuously decreases in performance, so every further call is further slowing the system down, which leads to even heavier RAM load. When we stop sending JSON RPC requests, the memory is not released.

I have captured all JSON requests, and if I execute them one by one, there are no problems at all. Mem usage does not increase, the response times are good. But as soon as I start pushing them in parallel to the node the drastic memory increase is guaranteed.

My assumption is that there is a memory leak occurring when JSON RPC calls run concurrently.

The text was updated successfully, but these errors were encountered:

MariusVanDerWijden · 2021-03-22T08:20:18Z

Can you share a sample set of JSON rpc requests with us, so we can replicate it better?

ivica7 · 2021-03-22T08:57:57Z

Sure, I have recorded the requests with the timestamp at the point of the mem leak. Here you go. It's basically eth_getLogs and eth_call. Later we noticed that the mass of the calls was due to a bug in the GUI, so it was a kind of unintentional load test on geth.

In the attached log, you'll find the detailed log of the calls during the mem leak including the timestamps. Moreover, there is a second file of some 50 calls. When I push them few times to the geth node, I can reproduce the mem leak.

2103_1_geth_memproblems.zip

MariusVanDerWijden · 2021-03-22T10:19:30Z

Hmm so I can't really repro the memory leak, But I can see that there is a lot of memory being allocated during the filtering. I noticed you'll filter the whole chain on every request. This means that the node needs to retrieve every header for every block for every call. This quickly adds up. In my case (after sending ~300.000 requests, this allocated ~280GB of memory. The memory is immediately freed though once the requests is done.
Could you maybe dump the memory on your node using debug.writeMemProfile("file") in the console? That way I could investigate it better.

ivica7 · 2021-03-22T10:26:04Z

@maran thanks for looking into this. Yes, we're using lot of Logs and collecting them from block 0 on. With 10M Blocks, it should be around 1G of memory for that only. But once in memory, is this header data reused for other requests, or each request fetches the headers on its own?

I'll repeat the load test on my side and send you the mem profile.

MariusVanDerWijden · 2021-03-22T10:33:31Z

There is a cache afaict for the headers, which grows with time (~1.9GB on my machine), but it seems most request will still retrieve a new header. Also the mapping blocknumber -> blockhash is not cached and has to be retrieved from DB on every request.

         .          .    517:// GetHeaderByNumber retrieves a block header from the database by number,
         .          .    518:// caching it (associated with its hash) if found.
         .          .    519:func (hc *HeaderChain) GetHeaderByNumber(number uint64) *types.Header {
         .   128.97GB    520:	hash := rawdb.ReadCanonicalHash(hc.chainDb, number)
         .          .    521:	if hash == (common.Hash{}) {
         .          .    522:		return nil
         .          .    523:	}
         .   141.70GB    524:	return hc.GetHeader(hash, number)
         .          .    525:}

         .          .    483:func (hc *HeaderChain) GetHeader(hash common.Hash, number uint64) *types.Header {
         .          .    484:	// Short circuit if the header's already in the cache, retrieve otherwise
         .          .    485:	if header, ok := hc.headerCache.Get(hash); ok {
         .          .    486:		return header.(*types.Header)
         .          .    487:	}
         .   137.66GB    488:	header := rawdb.ReadHeader(hc.chainDb, hash, number)
         .          .    489:	if header == nil {
         .          .    490:		return nil
         .          .    491:	}
         .          .    492:	// Cache the found header for next time and return
    1.87GB     4.30GB    493:	hc.headerCache.Add(hash, header)
         .          .    494:	return header

ivica7 · 2021-03-22T10:36:54Z

Here's the memprofile. I pushed the attached ~50 requests ~20 times via two wscat instances. The mem consumption jumped from 770M to 3.2G. If I continue the game, I can bring it to 32G+, which leads to the crash of geth.

memprofile.zip

EDIT: Mem goes only up, when the messages are pushed to geth fast so it comes to parallel execution. If I send the requests slowly, everything is fine. So doesn't seem to be a mem leak actually as you already said.

MariusVanDerWijden · 2021-03-22T11:00:55Z

Yeah, so what I think happens is that whenever a new request comes in, the handler asks the database for the compressed BloomBits for a section. It then tries to decompress the section into 512 bytes. This decompressed Bloom-bitset is stored in the task and kept until the task is finished. If many tasks are started concurrently that blows up, as this memory is guaranteed to be allocated for the duration of the call.

So memory comes back down if you stop sending requests?

ivica7 · 2021-03-22T11:04:49Z

@MariusVanDerWijden On OS level, I can only observe that when the memory gets allocated for the process once, it stays allocated. I can only observe that when I send in the requests slowly it doesn't grow. When I send it in fast in parallel, it grows fast and the requests take longer.

MariusVanDerWijden · 2021-03-22T12:00:34Z

Hmm yeah so we talked a bit about this issue internally. The problem is that the OS can not accurately see how much memory geth consumes. We do have metrics builtin that can be used to measure the ram usage accurately.
What would really help us triage this issue, is if you could run your node for some time (2 minutes), then execute your RPC calls (until memory gets up 5GB or so) and then let the node run for another 5 minutes to see if the used memory gets garbage collected.
And in between print the memory with debug.writeMemProfile("file[1/2/3]") so that we can see if the memory gets collected.
Alternatively, If you already collect metrics with --metrics that would be good too

ivica7 · 2021-03-22T15:03:42Z

@MariusVanDerWijden here you go. I did a mem dump every ~1min. After bringing him to 5G, I continued dumping the mem profile for ~5min.
memprofile.zip

MariusVanDerWijden · 2021-03-23T10:36:50Z

Yeah just as I thought, during extensive load the memory blows up in the bloomhandler since every request allocates 512 bytes that are retained for the duration of the call. After the calls are finished, the memory is deallocated. So there's no memory leak there. One thing we might look into is if we should cap the memory available to allocate for the calls.

memprofile-10:

(pprof) top30
Showing nodes accounting for 1554.85MB, 96.66% of 1608.54MB total
Dropped 107 nodes (cum <= 8.04MB)
Showing top 30 nodes out of 57
      flat  flat%   sum%        cum   cum%
         0     0%     0%  1053.76MB 65.51%  github.com/ethereum/go-ethereum/eth.(*Ethereum).startBloomHandlers.func1
         0     0%     0%  1044.51MB 64.94%  github.com/ethereum/go-ethereum/common/bitutil.DecompressBytes
         0     0%     0%  1044.51MB 64.94%  github.com/ethereum/go-ethereum/common/bitutil.bitsetDecodeBytes
 1044.51MB 64.94% 64.94%  1044.51MB 64.94%  github.com/ethereum/go-ethereum/common/bitutil.bitsetDecodePartialBytes
  353.83MB 22.00% 86.93%   353.83MB 22.00%  github.com/ethereum/go-ethereum/core/bloombits.(*scheduler).scheduleRequests

memprofile-13:

Showing nodes accounting for 156.21MB, 94.17% of 165.89MB total
Dropped 42 nodes (cum <= 0.83MB)
Showing top 30 nodes out of 81
      flat  flat%   sum%        cum   cum%
         0     0%     0%   139.37MB 84.01%  gopkg.in/urfave/cli%2ev1.(*App).Run
         0     0%     0%   139.37MB 84.01%  gopkg.in/urfave/cli%2ev1.HandleAction
         0     0%     0%   139.37MB 84.01%  main.geth
         0     0%     0%   137.86MB 83.10%  main.main
         0     0%     0%   137.86MB 83.10%  runtime.main
         0     0%     0%   135.37MB 81.60%  github.com/ethereum/go-ethereum/cmd/utils.RegisterEthService
         0     0%     0%   135.37MB 81.60%  github.com/ethereum/go-ethereum/eth.New

Thank you very much for the data @ivica7 !

ivica7 · 2021-03-23T14:25:46Z

@MariusVanDerWijden thank you for analyzing the problem so fast! Just wanted to ask: would it make sense to add a limit for how many requests can be processed in parallel. Adding some kind of worker pool? Eventually controlled via a CLI param like --rpcworkers --wsworkers. Moreover, is it possible to reuse the data between the requests?

And one more bonus question, if you have time: What I always wanted to know: when filtering events, bloom filters are used. These filters are checked block by block, right? There is no hierarchy of bloom filters, so you can exclude multiple blocks at once? So for our case with 11M blocks, that would mean that there are 11M bloom filter checks when getting logs from:0?

MariusVanDerWijden · 2021-03-23T15:16:52Z

Yeah, we've been discussing something like that already. I've implemented something similar for tracing here #21755 but that isn't finalized yet.

Yes afaict the bloom filters are checked block by block, see

go-ethereum/eth/filters/filter.go

Line 218 in e862cbf

for ; f.begin <= int64(end); f.begin++ {

I think that's the big problem in your use case, the getLogs take long to execute and retain a lot of memory per request, thus the next requests take even longer, retaining their memory longer, and so on. I would suggest (if that fits to your use case) to limit the from field in getLogs to something more reasonable.

I'll close the issue for now, if that's okay

nyetwurk · 2021-06-30T10:22:50Z

v1.10.4 is leaking with minimal RPC load now.

ivica7 added the type:bug label Mar 19, 2021

MariusVanDerWijden closed this as completed Mar 23, 2021

MariusVanDerWijden mentioned this issue Mar 24, 2021

Frequent OOMs and High CPU Usage While Serving Eth Calls #22567

Closed

AlexiaChen mentioned this issue Jun 8, 2021

golang中用Json Decoder产生的大量内存消耗问题 AlexiaChen/AlexiaChen.github.io#139

Open

AlexiaChen mentioned this issue Jul 29, 2021

eth_getLogs/hmy_getLogs rpcs fail with json error for larger block ranges harmony-one/harmony#3817

Closed

du5 mentioned this issue Sep 9, 2023

Geth memory leak on json-rpc response gzip compression bnb-chain/bsc#1801

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak on high JSON RPC load #22529

Memory leak on high JSON RPC load #22529

ivica7 commented Mar 19, 2021

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021 •

edited

Loading

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021 •

edited

Loading

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021

MariusVanDerWijden commented Mar 23, 2021

ivica7 commented Mar 23, 2021 •

edited

Loading

MariusVanDerWijden commented Mar 23, 2021

nyetwurk commented Jun 30, 2021

Memory leak on high JSON RPC load #22529

Memory leak on high JSON RPC load #22529

Comments

ivica7 commented Mar 19, 2021

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021 • edited Loading

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021 • edited Loading

MariusVanDerWijden commented Mar 22, 2021

ivica7 commented Mar 22, 2021

MariusVanDerWijden commented Mar 23, 2021

ivica7 commented Mar 23, 2021 • edited Loading

MariusVanDerWijden commented Mar 23, 2021

nyetwurk commented Jun 30, 2021

ivica7 commented Mar 22, 2021 •

edited

Loading

ivica7 commented Mar 22, 2021 •

edited

Loading

ivica7 commented Mar 23, 2021 •

edited

Loading