Skip to content

sockets: ordering of send() vs. poll() when using socket API + DTLS causes a crash #13380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mike-scott opened this issue Feb 13, 2019 · 3 comments
Assignees
Labels
area: Networking area: Sockets Networking sockets bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Milestone

Comments

@mike-scott
Copy link
Contributor

Describe the bug
Using current in-tree LwM2M client sample (samples/net/lwm2m_client) with overlay-dtls.conf, results in a crash due to how TLS handshake is implemented in the socket APIs.

For UDP, the TLS handshake is initiated during the first send(), for TCP it's started at connect(). In the LwM2M engine the socket is added to a poll() loop after connect() and this seems to be what is causing the crash. If I re-order the addition of the socket to the poll() loop to after the first send(), the crash goes away (such as https://github.com/foundriesio/zephyr/commit/9004322201131cf873e00611a9ddd2e3f38088e9)

To Reproduce
Steps to reproduce the behavior:

  1. (in another window) setup net-tools for qemu
  2. (in another window) start local Leshan (LwM2M server)
  3. cd samples/net/lwm2m_client
  4. mkdir build; cd build
  5. cmake -DBOARD=qemu_x86 -DOVERLAY_CONFIG=overlay-dtls.conf
  6. make run

Expected behavior
Expect to be able to add the socket to the poll() loop prior to send() and not have a crash in the LwM2M engine.

Impact
This behavior seems incorrect atm. It's a complexity, if the developer needs to know not to poll() on a socket until the first send() (for UDP).

Environment (please complete the following information):

  • OS: Arch Linux
  • Zephyr SDK
  • Master branch @ d2886ab
@mike-scott mike-scott added the bug The issue is a bug, or the PR is fixing a bug label Feb 13, 2019
@mike-scott mike-scott added this to the v1.14.0 milestone Feb 13, 2019
@rlubos
Copy link
Collaborator

rlubos commented Feb 14, 2019

Thanks for reporting, I'll have a look.

@rlubos
Copy link
Collaborator

rlubos commented Feb 18, 2019

@mike-scott Ok, so I've succeded to establish a local setup for LWM2M and have been able to reproduce the problem described. I've identified 3 issues that prevented the sample from behaving nicely:

  1. System worqueue stack was too small for the DTLS handshake to complete (that's the actual reason of application crash).
  2. lwm2m_engine had incorrect errno check:

    Because of this I've observed some spurious retransmissions.
  3. TLS poll indeed did not handle DTLS client sockets very well. It produced false error reports when it was used during the handshake.

Fixing 1 and 2 stabilized the sample, the only undesired behavior remaining was error from poll during the handshake. The first approach I took to fix this problem was to simply ignore ENOTCONN if DTLS client was monitored, yet that produced a deadlock in the application (poll was busy-looping it's thread, and as it was cooperative and had the same priority as the thread that was running the handshake, application stalled).

I'll post a PR soon with a different solution, that uses a semephore to block on poll when DLTS client is added. The poll will block on a semaphore until the handshake is done, and once completed it'll switch to use receive fifo instead.

@mike-scott
Copy link
Contributor Author

@rlubos Thank you for the PR! I will test and review shortly. I have also noticed that the system work queue was too small for LwM2M in other situations as well.

@nashif nashif added the priority: medium Medium impact/importance bug label Feb 19, 2019
@pfalcon pfalcon added the area: Sockets Networking sockets label Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking area: Sockets Networking sockets bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants