Skip to content

samples: net: lwm2m: Fix DTLS related issues #13477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 23, 2019

Conversation

rlubos
Copy link
Collaborator

@rlubos rlubos commented Feb 18, 2019

Fix LWM2M and DTLS issues observed in the LWM2M sample.

Fixes #13380

Signed-off-by: Robert Lubos [email protected]

@rlubos rlubos added bug The issue is a bug, or the PR is fixing a bug area: Networking labels Feb 18, 2019
@rlubos rlubos requested a review from mike-scott February 18, 2019 14:23
@codecov-io
Copy link

codecov-io commented Feb 18, 2019

Codecov Report

Merging #13477 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #13477   +/-   ##
=======================================
  Coverage   52.46%   52.46%           
=======================================
  Files         322      322           
  Lines       46597    46597           
  Branches    10769    10769           
=======================================
  Hits        24446    24446           
  Misses      17246    17246           
  Partials     4905     4905

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 882702e...93d0a76. Read the comment docs.

Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mike-scott mike-scott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested good for me. Thanks for the fix @rlubos

@@ -3929,11 +3929,6 @@ static void socket_receive_loop(void)
len = recvfrom(sock_ctx[i]->sock_fd, in_buf,
sizeof(in_buf) - 1, 0,
&from_addr, &from_addr_len);
if (errno) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would explicitly mention in the commit message that there's already check in the code for len < 0 case.

Copy link
Collaborator

@pfalcon pfalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, layering violation strikes again. I would suggest to separate poll hacks from the rest of the patches. The latter can go in immediately, while poll hacks need to scrutinized.

(*pev)->mode = K_POLL_MODE_NOTIFY_ONLY;
(*pev)->state = K_POLL_STATE_NOT_READY;

goto again;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what happens here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handshake is complete, still we don't have any application data on a socket. Therefore, poll event is reconfigured to monitor fifo now.

The reason for the extra monitor for the handshake is, that unless we provide some (presumably overcomplicated) mechanism for mulit-context handshake, we may end up with a deadlock within poll.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if it happens once, when TLS socket state changes from "handshake non-complete" to "handshake complete", sounds good.

@@ -1707,19 +1719,34 @@ static int ztls_poll_update_ctx(struct net_context *ctx,
}

if (pfd->events & ZSOCK_POLLIN) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if we polled for POLLOUT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we assume that socket is always writable. POLLOUT is not an issue in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we assume that socket is always writable.

But a socket is not always writable. For normal socket, this is well known issue (well, bug) present so far because resolving it would require adding more overhead to socket structures, and which is properly ticketed as #5842.

Based on the response above, I don't see that this issue was even considered/analyzed for TLS sockets, far less in queue for fixing.

@pfalcon
Copy link
Collaborator

pfalcon commented Feb 19, 2019

So, layering violation strikes again. I would suggest to separate poll hacks from the rest of the patches.

Then please explain how you think poll() should work on a DTLS socket, only once that is ok, we can look at the code to see whether it does the described or not.

@rlubos
Copy link
Collaborator Author

rlubos commented Feb 20, 2019

Then please explain how you think poll() should work on a DTLS socket

Nothing excepctional here - it should block until there's application data ready to read, or an error or timeout occurs. And as current implementation had an issue, hence bugfix.

I don't see a reason to split this PR. Commits are well separated, and they all serve a single purpose - make LWM2M sample behave correctly. And submiting partial solutions just for sake of (I'm not even sure for sake of what) is a needless overhead for me.

Errno value is only significant when `recvfrom` function indicated an
error (by returning -1). We should not depend on it's value if no error
is notified.

As the return value of `recvfrom` is already checked, misused errno
verification can simply be removed.

Signed-off-by: Robert Lubos <[email protected]>
System workqueue stack was not large enough to handle DTLS handshake,
which lead to system crash.

Signed-off-by: Robert Lubos <[email protected]>
Instead of simple bool value, use a semaphore to notify that TLS
handshake is complete. This way, we can monitor this value with k_poll.

Signed-off-by: Robert Lubos <[email protected]>
When DTLS client was added to `poll` before/during the handshake, it
could throw errors and in some circumstances (when polling thread was
cooperative and had higher or equal priority to the handshake thread)
could lead to a deadlock in the application.

Prevent that, by blocking on handshake semaphore instead of fifo. Poll
will start using fifo for data poll only after handshake is complete.

Signed-off-by: Robert Lubos <[email protected]>
@rlubos rlubos force-pushed the fix-dtls-lwm2m-issue branch from 50ab5ca to 93d0a76 Compare February 20, 2019 10:14
@pfalcon
Copy link
Collaborator

pfalcon commented Feb 20, 2019

and they all serve a single purpose - make LWM2M sample behave correctly. And submiting partial solutions just for sake of (I'm not even sure for sake of what)

So, to clarify, the single purpose should be "implement poll() correctly", not just hack it in a way to make one sample work, but possibly leaving even weirder corner cases for other samples and apps.

The purpose of split is usually to merge obvious parts soon, and focus on non-obvious parts without extra noise. If you don't want to split, sounds good, let's make each commit here wait until the last issue here is clarified.

@rlubos
Copy link
Collaborator Author

rlubos commented Feb 20, 2019

possibly leaving even weirder corner cases for other samples and apps

Could you please specify what other cases you mean? We've observed poll misbehaving, a bug in the implementation, and here is a fix for that specific misbehavement. Not a HACK (I really don't like this attitude) but a piece of code I find a proper fix for this problem. The fact it was observed in LWM2M sample does not automatically make it specific to that sample.

@pfalcon
Copy link
Collaborator

pfalcon commented Feb 20, 2019

Could you please specify what other cases you mean?

I will, please give me some time.

We've observed poll misbehaving, a bug in the implementation, and here is a fix for that specific misbehavement.

+1 for that, but I'm not sure that poll() still works correctly for TLS things. If we don't take cases like this to try to improve it overall, then when we'll get to it. (It's not that poll() just landed, so "it's good that it works at all". We're having Nth iteration on it, and would be nice to plan for them to converge.)

(And yes, non-TLS poll() has its issues too, it's just non-TLS and TLS cases are two separate ones, as TLS one would normally be implemented on application/library, not stack, side.)

@mike-scott
Copy link
Contributor

@pfalcon I have to agree with @rlubos on this. Unless you are planning a much larger change to the way poll() works for TLS -- and I'm not sure that's going to happen this late in the 1.14 LTS cycle -- then this patch seems to be a reasonable fix for a bug in the way the TLS handshake was affecting a pre-existing poll().

Of course, if there is an actual problem with the bugfix then we should re-examine.

Are you nervous that this implementation handles a bit of UDP vs. TCP magic? Even after reading your discussion, I'm not sure what your main concern is with this patch. It's just a bugfix.

@@ -1376,7 +1384,7 @@ static ssize_t sendto_dtls_client(struct net_context *ctx, const void *buf,
}
}

if (!ctx->tls->tls_established) {
if (!is_handshake_complete(ctx)) {
/* TODO For simplicity, TLS handshake blocks the socket even for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it's actually documented that non-blocking TLS sockets aren't really non-blocking.


again:
(*pev)++;
errno = EAGAIN;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As was mentioned previously, it's quite confusing to see errno assignment here. errno should be reserved for communication between system and userspace application. Part of system would rather communicate with -EAGAIN.

@pfalcon
Copy link
Collaborator

pfalcon commented Feb 21, 2019

Of course, if there is an actual problem with the bugfix then we should re-examine.

So, the problem is that nobody knows how many more problems are there. It would seem that discovery of one issue should prompt to pay more attention to those matters, but apparently nobody is interested in that.

Ok, I'm withdrawing my -1.

@pfalcon pfalcon dismissed their stale review February 21, 2019 20:53

Dismissing to not block quick fixes.

Copy link
Collaborator

@pfalcon pfalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving from my side too, so this isn't left for next week.

@mike-scott
Copy link
Contributor

@pfalcon I appreciate you removing your -1 to allow this to move forward. Can we open an issue to keep the spirit of your thoughts on poll() for TLS use case? Something to document your comments above so they are not lost?

@pfalcon
Copy link
Collaborator

pfalcon commented Feb 22, 2019

Can we open an issue to keep the spirit of your thoughts on poll() for TLS use case?

That's generally what I tried to drive too. I'd leave that to @rlubos though. My original concern was "hmm, I don't see all the advanced code required to handle poll-like functionality on non-blocking TLS socket". But a code comment reminded me that (according to it) there isn't really truly on-blocking TLS support. But again, I'd leave that to @rlubos at his convenience (I'd think that preparing for a release would be a good time to revisit, or at least ticket such cases, but clearly everyone has a lot on each own plate already).

@nashif nashif added this to the v1.14.0 milestone Feb 23, 2019
@nashif nashif merged commit f1920ff into zephyrproject-rtos:master Feb 23, 2019
@rlubos rlubos deleted the fix-dtls-lwm2m-issue branch October 8, 2019 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants