-
Notifications
You must be signed in to change notification settings - Fork 7.5k
TCP: fix deadlock on slow peer #14944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Codecov Report
@@ Coverage Diff @@
## master #14944 +/- ##
==========================================
+ Coverage 52.87% 52.88% +0.01%
==========================================
Files 309 309
Lines 45267 45288 +21
Branches 10451 10453 +2
==========================================
+ Hits 23934 23951 +17
- Misses 16558 16560 +2
- Partials 4775 4777 +2
Continue to review full report at Codecov.
|
subsys/net/ip/tcp.c
Outdated
@@ -379,6 +379,8 @@ static int prepare_segment(struct net_tcp *tcp, | |||
struct net_context *context = tcp->context; | |||
struct net_buf *tail = NULL; | |||
struct net_tcp_hdr *tcp_hdr; | |||
struct net_if *interface; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these are introduced here, if they are used only in the else {}
block (and thus can be very well declared there?).
@@ -2099,7 +2109,9 @@ NET_CONN_CB(tcp_synack_received) | |||
struct net_tcp_hdr *tcp_hdr = proto_hdr->tcp; | |||
int ret; | |||
|
|||
NET_ASSERT(context && context->tcp); | |||
NET_ASSERT(context); | |||
k_mutex_lock(&context->lock, K_FOREVER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is lock stuffed between 2 asserts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because context->tcp
can in theory change if someone else is holding the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because context->tcp can in theory change if someone else is holding the lock.
Ok, ack. I bet I'd personally add a comment on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updated patch and glad to see that @tbursztyka converged on it too. But I think there's a room to polish this patch, please see previous inline comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick about commit subject. Needs some more testing, which is most probably the trickiest part here.
I did some testing with
I used native_posix, so in Linux side I ran After zephyr is started (the command I used is The zephyr console prints this info:
The double free information is worrying even though it does not cause any harm here as the code bails out if it notices this issue. Without this PR, I am not seeing this error. |
That is indeed strange. If we pass through line 2451 in _tcp_syn_rcvd(), all functions (processing_data()->process_data()->net_ipv4_input()->net_conn_input()->tcp_syn_rcvd()) should return NET_OK until we are back in processing_data(). There the returned NET_OK should cause the net_pkt_unref() call to be skipped. |
@dgloeck, There're CI failures which look legitimate (i.e. don't look random):
Same + crash. Can you look into those? |
Confirmed that it's reproducible locally: tests/net/tcp:
I guess crashdump may just accompany the ASSERT failure. |
I know, will fix that this weekend. The tests call internal functions without locking the mutex that is now expected to be locked. |
Report was wrong, see next comment.
... |
The "hang" issues is probably fixed by #15045 (regression) |
The tests/net/tcp asserts should be resolved now. @jukkar's double unref with echo-server still needs to be debugged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spotted the "already freed" issue (see comment), after fixing that I did not see any errors.
Unlocks the mutex needed by the code for processing ACKs while waiting for TX packets to become available. Before the mutex can be unlocked, it must be locked by tcp_syn_rcvd(), tcp_synack_received(), and backlog_ack_timeout(), which in turn requires us to unlock it in net_tcp_connect() while waiting for tcp_synack_received() to be called. Fixes zephyrproject-rtos#14571 Signed-off-by: Daniel Glöckner <[email protected]>
Just tested this with native_posix board and valgrind:
For the testing I used Edit: the memory leak that I saw in dumb_http_server, is not present if running same command in master, so it is related to changes by this PR. |
Ping @dgloeck on the status/progress of this. Thanks. |
Closing since there hasn't been any response from the author |
Unlocks the mutex needed by the code for processing ACKs while waiting for TX packets to become available.
Before the mutex can be unlocked, it must be locked by tcp_syn_rcvd(), tcp_synack_received(), and backlog_ack_timeout(), which in turn requires us to unlock it in net_tcp_connect() while waiting for tcp_synack_received() to be called.
Fixes #14571