File transfer sometimes doesn't work #1572

DasBabyPixel · 2025-01-19T15:01:28Z

Stacktrace

Actions to reproduce

I have a setup where a minigame server downloads the Map to play on when the game starts.
This is done by calling TemplateStorage#openZipInputStreamAsync from the server main thread.
Sometimes the call never returns, causing the minigame server to be stuck in a waiting state.
As soon as that happens, all other requests also never get answered.
Because of that, as soon as it breaks once (usually once-twice a day), I have to restart the entire node to get the servers to work again.

This, to me, seems like an issue in the node.
I believe somehow the node tries to send the packet to the wrapper, but the packet never makes it all the way. The curious thing is, that any zip request packet from another server also never makes it back to the wrapper. There must be some sort of state in the node, that breaks.
Or some sort of deadlock, though I don't believe that. Maybe a race condition breaking a state? To me it seems that the packets should get sent to the wrapper (as per logging messages) so can this be a netty bug?
But that also seems unlikely, because other packets like ChannelMessages work fine.
I have no clue as to what exactly is happening here, those are just the things I thought of as of now.

I have done some debugging on my own to try and get more information about this issue. This issue is somewhat recent (I haven't always updated to the latest snapshots), but I don't think it is much older than 6 months, because until I updated the node everything worked fine.
My logging customizations are here.

The exact build I am running is here
My exact build has some more customizations, but mostly to module loading, nothing that should impact packet handling/file transfer except disabling zip compression

CloudNet version

[19.01 15:11:08.538] INFO :
[19.01 15:11:08.538] INFO : CloudNet Blizzard 4.0.0-RC12-SNAPSHOT f18671a
[19.01 15:11:08.538] INFO : Discord: https://discord.cloudnetservice.eu/
[19.01 15:11:08.538] INFO :
[19.01 15:11:08.538] INFO : ClusterId: ae0bbf39--431d--857e2580ae82
[19.01 15:11:08.538] INFO : NodeId: Node-1
[19.01 15:11:08.538] INFO : Head-NodeId: Node-1
[19.01 15:11:08.538] INFO : CPU usage: (P/S) .18/11.45/100%
[19.01 15:11:08.538] INFO : Node services memory allocation (U/R/M): 5532/5532/16384 MB
[19.01 15:11:08.538] INFO : Threads: 55
[19.01 15:11:08.538] INFO : Heap usage: 198/256MB
[19.01 15:11:08.538] INFO : JVM: Eclipse Adoptium 23 (OpenJDK 64-Bit Server VM 23.0.1+11)
[19.01 15:11:08.538] INFO : Update Repo: CloudNetService/launchermeta, Update Branch: beta (development mode)
[19.01 15:11:08.538] INFO :

Other

This is how the logging with my build should look like:

minigame server

[19.01 01:59:56.923] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Custom session registered: eu.cloudnetservice.driver.network.chunk.defaults.DefaultFileChunkedPacketHandler@1130df1c
[19.01 01:59:56.923] DEBUG: SendPacketSync BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@81a9884, prioritized=false, creationStamp=2025-01-19T00:59:56.923751720Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:56.958] DEBUG: Received packet multithreadEventLoopGroup-1-1: BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@ecfba07, prioritized=false, creationStamp=2025-01-19T00:59:56.958636121Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:57.087] DEBUG: Received packet multithreadEventLoopGroup-1-1: BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@202ef912, prioritized=false, creationStamp=2025-01-19T00:59:57.087004118Z, uniqueId=null)
[19.01 01:59:57.087] DEBUG: HandlePacket Packet-Dispatcher-1 NetworkClientChannelHandler: BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@202ef912, prioritized=false, creationStamp=2025-01-19T00:59:57.087004118Z, uniqueId=null)
[19.01 01:59:57.087] INFO : Receive on packet-com channel
[19.01 01:59:57.109] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Chunk received: ChunkSessionInformation[chunkSize=52428800, sessionUniqueId=9530b30c-e852-40c0-8935-1b59732308e9, transferChannel=query:dummy, transferInformation=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@7e21850] - Complete: true - Handler: eu.cloudnetservice.driver.network.chunk.defaults.DefaultFileChunkedPacketHandler@1130df1c
[19.01 01:59:57.111] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Session complete
[19.01 01:59:57.112] INFO : Receive on packet-com channel

node

[19.01 01:59:56.923] DEBUG: Received packet multithreadEventLoopGroup-4-4: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@3efdf11c, prioritized=false, creationStamp=2025-01-19T00:59:56.923950911Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:56.924] DEBUG: HandlePacket Packet-Dispatcher-3 DefaultNetworkServerChannelHandler: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@3efdf11c, prioritized=false, creationStamp=2025-01-19T00:59:56.923950911Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:56.924] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] File request received: remote_templates_zip_template
[19.01 01:59:56.924] DEBUG: Calling event eu.cloudnetservice.driver.network.chunk.event.FileQueryRequestEvent on listener eu.cloudnetservice.node.network.chunk.FileDeployCallbackListener
[19.01 01:59:56.958] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Responding to file request 10
[19.01 01:59:56.958] DEBUG: SendPacket BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@203cea6a, prioritized=false, creationStamp=2025-01-19T00:59:56.958483776Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:57.078] DEBUG: SendPacketSync BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@3ecf4574, prioritized=false, creationStamp=2025-01-19T00:59:57.078346121Z, uniqueId=null)
[19.01 01:59:57.080] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Sending last packet 0

how the log looks like when it breaks

minigame server

...
[19.01 14:53:05.534] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] Custom session registered: eu.cloudnetservice.driver.network.chunk.defaults.DefaultFileChunkedPacketHandler@430ab449
[19.01 14:53:05.534] DEBUG: SendPacketSync BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@612537eb, prioritized=false, creationStamp=2025-01-19T13:53:05.534742362Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.555] DEBUG: Received packet multithreadEventLoopGroup-1-1: BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@7e141535, prioritized=false, creationStamp=2025-01-19T13:53:05.555921815Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:06.286] DEBUG: execute query: SELECT `id`, `msg` FROM `luckperms_messenger` WHERE `id` > ? AND (NOW() - `time` < 30)
...

node

[19.01 14:53:05.534] DEBUG: Received packet multithreadEventLoopGroup-4-2: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@17930c33, prioritized=false, creationStamp=2025-01-19T13:53:05.534921786Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.534] DEBUG: HandlePacket Packet-Dispatcher-2 DefaultNetworkServerChannelHandler: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@17930c33, prioritized=false, creationStamp=2025-01-19T13:53:05.534921786Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.535] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] File request received: remote_templates_zip_template
[19.01 14:53:05.535] DEBUG: Calling event eu.cloudnetservice.driver.network.chunk.event.FileQueryRequestEvent on listener eu.cloudnetservice.node.network.chunk.FileDeployCallbackListener
[19.01 14:53:05.553] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] Responding to file request 8
[19.01 14:53:05.555] DEBUG: SendPacket BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@58a0c028, prioritized=false, creationStamp=2025-01-19T13:53:05.555636121Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.666] DEBUG: SendPacketSync BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@7d488351, prioritized=false, creationStamp=2025-01-19T13:53:05.666178521Z, uniqueId=null)
[19.01 14:53:05.939] DEBUG: Calling event eu.cloudnetservice.node.event.instance.CloudNetTickServiceStartEvent on listener eu.darkcube.minigame.woolbattle.module.ServiceListener
[19.01 14:53:06.225] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] Sending last packet 0

I filtered the node logs and removed the listener debug messages, otherwise the logs would be very cluttered.
The minigame server logs are from the .wrapper/logs directory

I have also created heap dumps of the problematic servers and saved the logs, should any questions arise here.

Issue uniqueness

Yes, this issue is unique. There are no similar issues.

EDIT1

Trying to create thousands of zip requests doesn't hasn't broken the thing yet for me. So this is difficult to reproduce.
The 1-2 breakages/day happen from ~15 requests spread over the day

EDIT2

I uploaded the relevant log files
working minigame
working node
broken minigame
broken node

The text was updated successfully, but these errors were encountered:

DasBabyPixel · 2025-01-21T02:49:00Z

I finally figured this out. Sending medium sized templates over the network repeatedly with node memory 256mb can cause problems.
Netty allocates buffers of 3-5mb in my case, which under jdk.nio.maxCachedBufferSize.

These buffers are cached, in a cache that is rather large

Because of that, the offheap memory never gets cleared. After a while, the offheap memory fills up and you get an OutOfMemoryError. But that error is silently ignored, which is a problem, because

1. it is impossible to debug
1. it can start memory leaks on the wrapper if a transfer never completes

I inserted some code at the right place to get this error.

[21.01 03:21:03.429] ERROR: Failed to send packet
io.netty5.handler.codec.EncoderException: java.lang.OutOfMemoryError: Cannot reserve 3956330 bytes of direct buffer memory (allocated: 267155146, limit: 268435456)
        at io.netty5.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:95)
        at io.netty5.channel.DefaultChannelHandlerContext.invokeWrite(DefaultChannelHandlerContext.java:853)
        at io.netty5.channel.DefaultChannelHandlerContext$AbstractWriteTask.write(DefaultChannelHandlerContext.java:1216)
        at io.netty5.channel.DefaultChannelHandlerContext$WriteAndFlushTask.write(DefaultChannelHandlerContext.java:1267)
        at io.netty5.channel.DefaultChannelHandlerContext$AbstractWriteTask.run(DefaultChannelHandlerContext.java:1186)
        at io.netty5.util.concurrent.SingleThreadEventExecutor.runTask(SingleThreadEventExecutor.java:338)
        at io.netty5.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:361)
        at io.netty5.channel.SingleThreadEventLoop.run(SingleThreadEventLoop.java:180)
        at io.netty5.util.concurrent.SingleThreadEventExecutor.lambda$doStartThread$4(SingleThreadEventExecutor.java:774)
        at io.netty5.util.internal.ThreadExecutorMap.lambda$apply$1(ThreadExecutorMap.java:68)
        at io.netty5.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.OutOfMemoryError: Cannot reserve 3956330 bytes of direct buffer memory (allocated: 267155146, limit: 268435456)
        at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
        at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:111)
        at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:363)
        at io.netty5.buffer.bytebuffer.ByteBufferMemoryManager.allocateShared(ByteBufferMemoryManager.java:51)
        at eu.cloudnetservice.driver.impl.network.netty.buffer.NettyNioBufferReleasingAllocator.allocate(NettyNioBufferReleasingAllocator.java:83)
        at eu.cloudnetservice.driver.impl.network.netty.codec.NettyPacketEncoder.allocateBuffer(NettyPacketEncoder.java:61)
        at eu.cloudnetservice.driver.impl.network.netty.codec.NettyPacketEncoder.allocateBuffer(NettyPacketEncoder.java:41)
        at io.netty5.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:79)
        ... 11 common frames omitted

I will go to sleep for now, it's 4am and I'm done. I will open a PR to fix this, probably tomorrow, after a good night's sleep and a clear head.

DasBabyPixel · 2025-01-21T17:41:19Z

I have lied, netty does not cache buffers.
Instead, when handling files, the jvm caches buffers, probably for performance. But when handling large files, like zipping templates, and then the requests being distributed to several worker threads, each with a buffer size of 1024 (on my machine), and each buffer having (only) up to 5mb of data, this adds up.
6 (worker threads) * 1024 (cache size) * 510241024 (small zip template) ~ 30GB...
With a max off heap memory size of 256, being the default of -Xmx, the off heap memory can fill up fast.

It is possible to reduce the max size of these buffers, or even disable them with the jdk.nio.maxCachedBufferSize system property.

For the node this can be done with

# Greatly reduces the maximum size (default=infinite) of buffers cached by the jvm. These cached buffers can cause
# OutOfMemoryErrors, and with many netty threads existing, each thread with its own cache, capable of storing hundreds
# to thousands of buffers each, this adds up.
#var jdk.nio.maxCachedBufferSize 512

For the wrapper this can be done by specifying the jvmArgs in the task.

DasBabyPixel added s: needs triage Issue waiting for triage t: bug Something isn't working as intended labels Jan 19, 2025

DasBabyPixel changed the title ~~File transfer sometimes doesn't work (race condition/deadlock?)~~ File transfer sometimes doesn't work Jan 21, 2025

DasBabyPixel linked a pull request Jan 21, 2025 that will close this issue

Fix networking memory leaks and file transfer issues #1578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File transfer sometimes doesn't work #1572

File transfer sometimes doesn't work #1572

DasBabyPixel commented Jan 19, 2025 •

edited

Loading

DasBabyPixel commented Jan 21, 2025

DasBabyPixel commented Jan 21, 2025 •

edited

Loading

File transfer sometimes doesn't work #1572

File transfer sometimes doesn't work #1572

Comments

DasBabyPixel commented Jan 19, 2025 • edited Loading

Stacktrace

Actions to reproduce

CloudNet version

Other

This is how the logging with my build should look like:

minigame server

node

how the log looks like when it breaks

minigame server

node

Issue uniqueness

EDIT1

EDIT2

DasBabyPixel commented Jan 21, 2025

DasBabyPixel commented Jan 21, 2025 • edited Loading

DasBabyPixel commented Jan 19, 2025 •

edited

Loading

DasBabyPixel commented Jan 21, 2025 •

edited

Loading