Skip to content

File transfer sometimes doesn't work #1572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
DasBabyPixel opened this issue Jan 19, 2025 · 2 comments · May be fixed by #1578
Open
1 task done

File transfer sometimes doesn't work #1572

DasBabyPixel opened this issue Jan 19, 2025 · 2 comments · May be fixed by #1578
Labels
s: needs triage Issue waiting for triage t: bug Something isn't working as intended

Comments

@DasBabyPixel
Copy link
Contributor

DasBabyPixel commented Jan 19, 2025

Stacktrace

Actions to reproduce

I have a setup where a minigame server downloads the Map to play on when the game starts.
This is done by calling TemplateStorage#openZipInputStreamAsync from the server main thread.
Sometimes the call never returns, causing the minigame server to be stuck in a waiting state.
As soon as that happens, all other requests also never get answered.
Because of that, as soon as it breaks once (usually once-twice a day), I have to restart the entire node to get the servers to work again.

This, to me, seems like an issue in the node.
I believe somehow the node tries to send the packet to the wrapper, but the packet never makes it all the way. The curious thing is, that any zip request packet from another server also never makes it back to the wrapper. There must be some sort of state in the node, that breaks.
Or some sort of deadlock, though I don't believe that. Maybe a race condition breaking a state? To me it seems that the packets should get sent to the wrapper (as per logging messages) so can this be a netty bug?
But that also seems unlikely, because other packets like ChannelMessages work fine.
I have no clue as to what exactly is happening here, those are just the things I thought of as of now.

I have done some debugging on my own to try and get more information about this issue. This issue is somewhat recent (I haven't always updated to the latest snapshots), but I don't think it is much older than 6 months, because until I updated the node everything worked fine.
My logging customizations are here.

The exact build I am running is here
My exact build has some more customizations, but mostly to module loading, nothing that should impact packet handling/file transfer except disabling zip compression

CloudNet version

[19.01 15:11:08.538] INFO :
[19.01 15:11:08.538] INFO : CloudNet Blizzard 4.0.0-RC12-SNAPSHOT f18671a
[19.01 15:11:08.538] INFO : Discord: https://discord.cloudnetservice.eu/
[19.01 15:11:08.538] INFO :
[19.01 15:11:08.538] INFO : ClusterId: ae0bbf39--431d--857e2580ae82
[19.01 15:11:08.538] INFO : NodeId: Node-1
[19.01 15:11:08.538] INFO : Head-NodeId: Node-1
[19.01 15:11:08.538] INFO : CPU usage: (P/S) .18/11.45/100%
[19.01 15:11:08.538] INFO : Node services memory allocation (U/R/M): 5532/5532/16384 MB
[19.01 15:11:08.538] INFO : Threads: 55
[19.01 15:11:08.538] INFO : Heap usage: 198/256MB
[19.01 15:11:08.538] INFO : JVM: Eclipse Adoptium 23 (OpenJDK 64-Bit Server VM 23.0.1+11)
[19.01 15:11:08.538] INFO : Update Repo: CloudNetService/launchermeta, Update Branch: beta (development mode)
[19.01 15:11:08.538] INFO :

Other

This is how the logging with my build should look like:

minigame server

[19.01 01:59:56.923] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Custom session registered: eu.cloudnetservice.driver.network.chunk.defaults.DefaultFileChunkedPacketHandler@1130df1c
[19.01 01:59:56.923] DEBUG: SendPacketSync BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@81a9884, prioritized=false, creationStamp=2025-01-19T00:59:56.923751720Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:56.958] DEBUG: Received packet multithreadEventLoopGroup-1-1: BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@ecfba07, prioritized=false, creationStamp=2025-01-19T00:59:56.958636121Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:57.087] DEBUG: Received packet multithreadEventLoopGroup-1-1: BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@202ef912, prioritized=false, creationStamp=2025-01-19T00:59:57.087004118Z, uniqueId=null)
[19.01 01:59:57.087] DEBUG: HandlePacket Packet-Dispatcher-1 NetworkClientChannelHandler: BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@202ef912, prioritized=false, creationStamp=2025-01-19T00:59:57.087004118Z, uniqueId=null)
[19.01 01:59:57.087] INFO : Receive on packet-com channel
[19.01 01:59:57.109] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Chunk received: ChunkSessionInformation[chunkSize=52428800, sessionUniqueId=9530b30c-e852-40c0-8935-1b59732308e9, transferChannel=query:dummy, transferInformation=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@7e21850] - Complete: true - Handler: eu.cloudnetservice.driver.network.chunk.defaults.DefaultFileChunkedPacketHandler@1130df1c
[19.01 01:59:57.111] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Session complete
[19.01 01:59:57.112] INFO : Receive on packet-com channel

node

[19.01 01:59:56.923] DEBUG: Received packet multithreadEventLoopGroup-4-4: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@3efdf11c, prioritized=false, creationStamp=2025-01-19T00:59:56.923950911Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:56.924] DEBUG: HandlePacket Packet-Dispatcher-3 DefaultNetworkServerChannelHandler: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@3efdf11c, prioritized=false, creationStamp=2025-01-19T00:59:56.923950911Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:56.924] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] File request received: remote_templates_zip_template
[19.01 01:59:56.924] DEBUG: Calling event eu.cloudnetservice.driver.network.chunk.event.FileQueryRequestEvent on listener eu.cloudnetservice.node.network.chunk.FileDeployCallbackListener
[19.01 01:59:56.958] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Responding to file request 10
[19.01 01:59:56.958] DEBUG: SendPacket BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@203cea6a, prioritized=false, creationStamp=2025-01-19T00:59:56.958483776Z, uniqueId=99ea61d9-ce73-4320-a94d-6b09e7c5085b)
[19.01 01:59:57.078] DEBUG: SendPacketSync BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@3ecf4574, prioritized=false, creationStamp=2025-01-19T00:59:57.078346121Z, uniqueId=null)
[19.01 01:59:57.080] INFO : [9530b30c-e852-40c0-8935-1b59732308e9] Sending last packet 0

how the log looks like when it breaks

minigame server

...
[19.01 14:53:05.534] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] Custom session registered: eu.cloudnetservice.driver.network.chunk.defaults.DefaultFileChunkedPacketHandler@430ab449
[19.01 14:53:05.534] DEBUG: SendPacketSync BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@612537eb, prioritized=false, creationStamp=2025-01-19T13:53:05.534742362Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.555] DEBUG: Received packet multithreadEventLoopGroup-1-1: BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@7e141535, prioritized=false, creationStamp=2025-01-19T13:53:05.555921815Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:06.286] DEBUG: execute query: SELECT `id`, `msg` FROM `luckperms_messenger` WHERE `id` > ? AND (NOW() - `time` < 30)
...

node

[19.01 14:53:05.534] DEBUG: Received packet multithreadEventLoopGroup-4-2: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@17930c33, prioritized=false, creationStamp=2025-01-19T13:53:05.534921786Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.534] DEBUG: HandlePacket Packet-Dispatcher-2 DefaultNetworkServerChannelHandler: BasePacket(channel=1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyImmutableDataBuf@17930c33, prioritized=false, creationStamp=2025-01-19T13:53:05.534921786Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.535] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] File request received: remote_templates_zip_template
[19.01 14:53:05.535] DEBUG: Calling event eu.cloudnetservice.driver.network.chunk.event.FileQueryRequestEvent on listener eu.cloudnetservice.node.network.chunk.FileDeployCallbackListener
[19.01 14:53:05.553] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] Responding to file request 8
[19.01 14:53:05.555] DEBUG: SendPacket BasePacket(channel=-1, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@58a0c028, prioritized=false, creationStamp=2025-01-19T13:53:05.555636121Z, uniqueId=3b88d869-9375-4def-b4ea-ec068b16ac53)
[19.01 14:53:05.666] DEBUG: SendPacketSync BasePacket(channel=2, dataBuf=eu.cloudnetservice.driver.network.netty.buffer.NettyMutableDataBuf@7d488351, prioritized=false, creationStamp=2025-01-19T13:53:05.666178521Z, uniqueId=null)
[19.01 14:53:05.939] DEBUG: Calling event eu.cloudnetservice.node.event.instance.CloudNetTickServiceStartEvent on listener eu.darkcube.minigame.woolbattle.module.ServiceListener
[19.01 14:53:06.225] INFO : [c312112f-c124-43b4-82d1-a96a1c167ce6] Sending last packet 0

I filtered the node logs and removed the listener debug messages, otherwise the logs would be very cluttered.
The minigame server logs are from the .wrapper/logs directory

I have also created heap dumps of the problematic servers and saved the logs, should any questions arise here.

Issue uniqueness

  • Yes, this issue is unique. There are no similar issues.

EDIT1

Trying to create thousands of zip requests doesn't hasn't broken the thing yet for me. So this is difficult to reproduce.
The 1-2 breakages/day happen from ~15 requests spread over the day

EDIT2

I uploaded the relevant log files
working minigame
working node
broken minigame
broken node

@DasBabyPixel DasBabyPixel added s: needs triage Issue waiting for triage t: bug Something isn't working as intended labels Jan 19, 2025
@DasBabyPixel DasBabyPixel changed the title File transfer sometimes doesn't work (race condition/deadlock?) File transfer sometimes doesn't work Jan 21, 2025
@DasBabyPixel
Copy link
Contributor Author

I finally figured this out. Sending medium sized templates over the network repeatedly with node memory 256mb can cause problems.
Netty allocates buffers of 3-5mb in my case, which under jdk.nio.maxCachedBufferSize.

Image

These buffers are cached, in a cache that is rather large

Image

Because of that, the offheap memory never gets cleared. After a while, the offheap memory fills up and you get an OutOfMemoryError. But that error is silently ignored, which is a problem, because

    1. it is impossible to debug
    1. it can start memory leaks on the wrapper if a transfer never completes

I inserted some code at the right place to get this error.

[21.01 03:21:03.429] ERROR: Failed to send packet
io.netty5.handler.codec.EncoderException: java.lang.OutOfMemoryError: Cannot reserve 3956330 bytes of direct buffer memory (allocated: 267155146, limit: 268435456)
        at io.netty5.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:95)
        at io.netty5.channel.DefaultChannelHandlerContext.invokeWrite(DefaultChannelHandlerContext.java:853)
        at io.netty5.channel.DefaultChannelHandlerContext$AbstractWriteTask.write(DefaultChannelHandlerContext.java:1216)
        at io.netty5.channel.DefaultChannelHandlerContext$WriteAndFlushTask.write(DefaultChannelHandlerContext.java:1267)
        at io.netty5.channel.DefaultChannelHandlerContext$AbstractWriteTask.run(DefaultChannelHandlerContext.java:1186)
        at io.netty5.util.concurrent.SingleThreadEventExecutor.runTask(SingleThreadEventExecutor.java:338)
        at io.netty5.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:361)
        at io.netty5.channel.SingleThreadEventLoop.run(SingleThreadEventLoop.java:180)
        at io.netty5.util.concurrent.SingleThreadEventExecutor.lambda$doStartThread$4(SingleThreadEventExecutor.java:774)
        at io.netty5.util.internal.ThreadExecutorMap.lambda$apply$1(ThreadExecutorMap.java:68)
        at io.netty5.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.OutOfMemoryError: Cannot reserve 3956330 bytes of direct buffer memory (allocated: 267155146, limit: 268435456)
        at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
        at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:111)
        at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:363)
        at io.netty5.buffer.bytebuffer.ByteBufferMemoryManager.allocateShared(ByteBufferMemoryManager.java:51)
        at eu.cloudnetservice.driver.impl.network.netty.buffer.NettyNioBufferReleasingAllocator.allocate(NettyNioBufferReleasingAllocator.java:83)
        at eu.cloudnetservice.driver.impl.network.netty.codec.NettyPacketEncoder.allocateBuffer(NettyPacketEncoder.java:61)
        at eu.cloudnetservice.driver.impl.network.netty.codec.NettyPacketEncoder.allocateBuffer(NettyPacketEncoder.java:41)
        at io.netty5.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:79)
        ... 11 common frames omitted

I will go to sleep for now, it's 4am and I'm done. I will open a PR to fix this, probably tomorrow, after a good night's sleep and a clear head.

@DasBabyPixel
Copy link
Contributor Author

DasBabyPixel commented Jan 21, 2025

I have lied, netty does not cache buffers.
Instead, when handling files, the jvm caches buffers, probably for performance. But when handling large files, like zipping templates, and then the requests being distributed to several worker threads, each with a buffer size of 1024 (on my machine), and each buffer having (only) up to 5mb of data, this adds up.
6 (worker threads) * 1024 (cache size) * 510241024 (small zip template) ~ 30GB...
With a max off heap memory size of 256, being the default of -Xmx, the off heap memory can fill up fast.

Image

It is possible to reduce the max size of these buffers, or even disable them with the jdk.nio.maxCachedBufferSize system property.

For the node this can be done with

# Greatly reduces the maximum size (default=infinite) of buffers cached by the jvm. These cached buffers can cause
# OutOfMemoryErrors, and with many netty threads existing, each thread with its own cache, capable of storing hundreds
# to thousands of buffers each, this adds up.
#var jdk.nio.maxCachedBufferSize 512

For the wrapper this can be done by specifying the jvmArgs in the task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s: needs triage Issue waiting for triage t: bug Something isn't working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant