samples: net: lwm2m: Fix DTLS related issues #13477

rlubos · 2019-02-18T14:23:12Z

Fix LWM2M and DTLS issues observed in the LWM2M sample.

Signed-off-by: Robert Lubos [email protected]

codecov-io · 2019-02-18T15:01:15Z

Codecov Report

Merging #13477 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #13477   +/-   ##
=======================================
  Coverage   52.46%   52.46%           
=======================================
  Files         322      322           
  Lines       46597    46597           
  Branches    10769    10769           
=======================================
  Hits        24446    24446           
  Misses      17246    17246           
  Partials     4905     4905

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 882702e...93d0a76. Read the comment docs.

jukkar

LGTM

mike-scott

Tested good for me. Thanks for the fix @rlubos

pfalcon · 2019-02-19T15:17:19Z

subsys/net/lib/lwm2m/lwm2m_engine.c

@@ -3929,11 +3929,6 @@ static void socket_receive_loop(void)
 			len = recvfrom(sock_ctx[i]->sock_fd, in_buf,
 				       sizeof(in_buf) - 1, 0,
 				       &from_addr, &from_addr_len);
-			if (errno) {


I personally would explicitly mention in the commit message that there's already check in the code for len < 0 case.

pfalcon

So, layering violation strikes again. I would suggest to separate poll hacks from the rest of the patches. The latter can go in immediately, while poll hacks need to scrutinized.

pfalcon · 2019-02-19T15:23:23Z

subsys/net/lib/sockets/sockets_tls.c

+			(*pev)->mode = K_POLL_MODE_NOTIFY_ONLY;
+			(*pev)->state = K_POLL_STATE_NOT_READY;
+
+			goto again;


Please explain what happens here.

Handshake is complete, still we don't have any application data on a socket. Therefore, poll event is reconfigured to monitor fifo now.

The reason for the extra monitor for the handshake is, that unless we provide some (presumably overcomplicated) mechanism for mulit-context handshake, we may end up with a deadlock within poll.

Ok, so if it happens once, when TLS socket state changes from "handshake non-complete" to "handshake complete", sounds good.

pfalcon · 2019-02-19T15:23:30Z

subsys/net/lib/sockets/sockets_tls.c

@@ -1707,19 +1719,34 @@ static int ztls_poll_update_ctx(struct net_context *ctx,
 	}

 	if (pfd->events & ZSOCK_POLLIN) {


And what if we polled for POLLOUT?

Then we assume that socket is always writable. POLLOUT is not an issue in this case.

Then we assume that socket is always writable.

But a socket is not always writable. For normal socket, this is well known issue (well, bug) present so far because resolving it would require adding more overhead to socket structures, and which is properly ticketed as #5842.

Based on the response above, I don't see that this issue was even considered/analyzed for TLS sockets, far less in queue for fixing.

pfalcon · 2019-02-19T15:27:01Z

So, layering violation strikes again. I would suggest to separate poll hacks from the rest of the patches.

Then please explain how you think poll() should work on a DTLS socket, only once that is ok, we can look at the code to see whether it does the described or not.

rlubos · 2019-02-20T10:10:56Z

Then please explain how you think poll() should work on a DTLS socket

Nothing excepctional here - it should block until there's application data ready to read, or an error or timeout occurs. And as current implementation had an issue, hence bugfix.

I don't see a reason to split this PR. Commits are well separated, and they all serve a single purpose - make LWM2M sample behave correctly. And submiting partial solutions just for sake of (I'm not even sure for sake of what) is a needless overhead for me.

Errno value is only significant when `recvfrom` function indicated an error (by returning -1). We should not depend on it's value if no error is notified. As the return value of `recvfrom` is already checked, misused errno verification can simply be removed. Signed-off-by: Robert Lubos <[email protected]>

System workqueue stack was not large enough to handle DTLS handshake, which lead to system crash. Signed-off-by: Robert Lubos <[email protected]>

Instead of simple bool value, use a semaphore to notify that TLS handshake is complete. This way, we can monitor this value with k_poll. Signed-off-by: Robert Lubos <[email protected]>

When DTLS client was added to `poll` before/during the handshake, it could throw errors and in some circumstances (when polling thread was cooperative and had higher or equal priority to the handshake thread) could lead to a deadlock in the application. Prevent that, by blocking on handshake semaphore instead of fifo. Poll will start using fifo for data poll only after handshake is complete. Signed-off-by: Robert Lubos <[email protected]>

pfalcon · 2019-02-20T10:19:00Z

and they all serve a single purpose - make LWM2M sample behave correctly. And submiting partial solutions just for sake of (I'm not even sure for sake of what)

So, to clarify, the single purpose should be "implement poll() correctly", not just hack it in a way to make one sample work, but possibly leaving even weirder corner cases for other samples and apps.

The purpose of split is usually to merge obvious parts soon, and focus on non-obvious parts without extra noise. If you don't want to split, sounds good, let's make each commit here wait until the last issue here is clarified.

rlubos · 2019-02-20T10:29:15Z

possibly leaving even weirder corner cases for other samples and apps

Could you please specify what other cases you mean? We've observed poll misbehaving, a bug in the implementation, and here is a fix for that specific misbehavement. Not a HACK (I really don't like this attitude) but a piece of code I find a proper fix for this problem. The fact it was observed in LWM2M sample does not automatically make it specific to that sample.

pfalcon · 2019-02-20T10:48:43Z

Could you please specify what other cases you mean?

I will, please give me some time.

We've observed poll misbehaving, a bug in the implementation, and here is a fix for that specific misbehavement.

+1 for that, but I'm not sure that poll() still works correctly for TLS things. If we don't take cases like this to try to improve it overall, then when we'll get to it. (It's not that poll() just landed, so "it's good that it works at all". We're having Nth iteration on it, and would be nice to plan for them to converge.)

(And yes, non-TLS poll() has its issues too, it's just non-TLS and TLS cases are two separate ones, as TLS one would normally be implemented on application/library, not stack, side.)

mike-scott · 2019-02-21T20:17:25Z

@pfalcon I have to agree with @rlubos on this. Unless you are planning a much larger change to the way poll() works for TLS -- and I'm not sure that's going to happen this late in the 1.14 LTS cycle -- then this patch seems to be a reasonable fix for a bug in the way the TLS handshake was affecting a pre-existing poll().

Of course, if there is an actual problem with the bugfix then we should re-examine.

Are you nervous that this implementation handles a bit of UDP vs. TCP magic? Even after reading your discussion, I'm not sure what your main concern is with this patch. It's just a bugfix.

pfalcon · 2019-02-21T20:43:47Z

subsys/net/lib/sockets/sockets_tls.c

@@ -1376,7 +1384,7 @@ static ssize_t sendto_dtls_client(struct net_context *ctx, const void *buf,
 		}
 	}

-	if (!ctx->tls->tls_established) {
+	if (!is_handshake_complete(ctx)) {
 		/* TODO For simplicity, TLS handshake blocks the socket even for


Ok, so it's actually documented that non-blocking TLS sockets aren't really non-blocking.

pfalcon · 2019-02-21T20:47:31Z

subsys/net/lib/sockets/sockets_tls.c

+
+again:
+	(*pev)++;
+	errno = EAGAIN;


As was mentioned previously, it's quite confusing to see errno assignment here. errno should be reserved for communication between system and userspace application. Part of system would rather communicate with -EAGAIN.

pfalcon · 2019-02-21T20:52:47Z

Of course, if there is an actual problem with the bugfix then we should re-examine.

So, the problem is that nobody knows how many more problems are there. It would seem that discovery of one issue should prompt to pay more attention to those matters, but apparently nobody is interested in that.

Ok, I'm withdrawing my -1.

Dismissing to not block quick fixes.

pfalcon

Approving from my side too, so this isn't left for next week.

mike-scott · 2019-02-22T20:19:05Z

@pfalcon I appreciate you removing your -1 to allow this to move forward. Can we open an issue to keep the spirit of your thoughts on poll() for TLS use case? Something to document your comments above so they are not lost?

pfalcon · 2019-02-22T20:29:09Z

Can we open an issue to keep the spirit of your thoughts on poll() for TLS use case?

That's generally what I tried to drive too. I'd leave that to @rlubos though. My original concern was "hmm, I don't see all the advanced code required to handle poll-like functionality on non-blocking TLS socket". But a code comment reminded me that (according to it) there isn't really truly on-blocking TLS support. But again, I'd leave that to @rlubos at his convenience (I'd think that preparing for a release would be a good time to revisit, or at least ticket such cases, but clearly everyone has a lot on each own plate already).

rlubos added bug The issue is a bug, or the PR is fixing a bug area: Networking labels Feb 18, 2019

rlubos requested a review from mike-scott February 18, 2019 14:23

rlubos requested review from jukkar, pfalcon and tbursztyka as code owners February 18, 2019 14:23

jukkar approved these changes Feb 19, 2019

View reviewed changes

mike-scott approved these changes Feb 19, 2019

View reviewed changes

pfalcon reviewed Feb 19, 2019

View reviewed changes

pfalcon previously requested changes Feb 19, 2019

View reviewed changes

rlubos added 4 commits February 20, 2019 11:11

samples: net: lwm2m: Fix system workqueue stack size

1bd2de5

System workqueue stack was not large enough to handle DTLS handshake, which lead to system crash. Signed-off-by: Robert Lubos <[email protected]>

net: sockets: tls: Use semaphore to notify that handshake is complete

7b39d79

Instead of simple bool value, use a semaphore to notify that TLS handshake is complete. This way, we can monitor this value with k_poll. Signed-off-by: Robert Lubos <[email protected]>

rlubos force-pushed the fix-dtls-lwm2m-issue branch from 50ab5ca to 93d0a76 Compare February 20, 2019 10:14

mike-scott assigned jukkar Feb 21, 2019

pfalcon reviewed Feb 21, 2019

View reviewed changes

pfalcon approved these changes Feb 22, 2019

View reviewed changes

nashif added this to the v1.14.0 milestone Feb 23, 2019

nashif merged commit f1920ff into zephyrproject-rtos:master Feb 23, 2019

rlubos deleted the fix-dtls-lwm2m-issue branch October 8, 2019 08:48

		@@ -1707,19 +1719,34 @@ static int ztls_poll_update_ctx(struct net_context *ctx,
		}

		if (pfd->events & ZSOCK_POLLIN) {

samples: net: lwm2m: Fix DTLS related issues #13477

samples: net: lwm2m: Fix DTLS related issues #13477

Uh oh!

Conversation

rlubos commented Feb 18, 2019

Uh oh!

codecov-io commented Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jukkar left a comment

Choose a reason for hiding this comment

Uh oh!

mike-scott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pfalcon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pfalcon commented Feb 19, 2019

Uh oh!

rlubos commented Feb 20, 2019

Uh oh!

pfalcon commented Feb 20, 2019

Uh oh!

rlubos commented Feb 20, 2019

Uh oh!

pfalcon commented Feb 20, 2019

Uh oh!

mike-scott commented Feb 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pfalcon commented Feb 21, 2019

Uh oh!

pfalcon left a comment

Choose a reason for hiding this comment

Uh oh!

mike-scott commented Feb 22, 2019

Uh oh!

pfalcon commented Feb 22, 2019

Uh oh!

Uh oh!

codecov-io commented Feb 18, 2019 •

edited

Loading