Skip to content

net: shell: dns: Properly manage lifetime of callback data #13260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 12, 2019

Conversation

pfalcon
Copy link
Collaborator

@pfalcon pfalcon commented Feb 11, 2019

DNS callback needs "struct shell *shell" data structure to pass as
a parameter to shell print. How it was achieved previously is that
it was packaged together with cosmetic "bool first" param into
"struct net_shell_user_data" on the stack, and passed to the
callback. The problem was that the original command handler then
returned, so the "struct net_shell_user_data" on the stack was
overwritten, and the callback crashed on accessing it.

An obvious solution was to make that structure static, but that would
leave to issues still, as turns out we allow system shell to be run
as more than one concurrent instances.

Next solution was to keep this structure on the stack, but block the
command handler until callback is finished. However, that hit a
deadlock due to not well thought out use of a mutex in the shell
printing routines.

The solution presented here is due to @nordic-krch, who noticed that
"bool first" param is indeed cosmetic and not really required. Then
we have only "struct shell *shell" to pass to the callback, and can
do that in callback's pointer param directly, ditching
"struct net_shell_user_data" which needs to be stored on the stack.

Signed-off-by: Paul Sokolovsky [email protected]

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 11, 2019

@agross-linaro: Let's start simple. I would like to say right away that we may still arrive at your solution (using getaddrinfo), but the whole idea to trace the path thru the issues, so we (not even just me) understand the matter and how to avoid similar cases - not just fix one case. So, please bare with us as we trace thru this.

So, this is the minimal changes required to fix any "net dns query &lthost>" crashes for me. In particular, I don't see anything special in the behavior of net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A (as the original repro case) from e.g. net dns query google.com.

Elsewhere you wrote that with an alternative patch, similar in that nature to this one, you still experienced some issues, and that's what prompted you to go the getaddrinfo way. Can you elaborate on that?

Now I tested this only with qemu_x86. Proceeding to test with frdm_k64f, I'm blocked by #13251, #13250 . I may imagine that the real meat of the issue shows itself when the shell is accessed via telnet, and I yet will need to look into that. So, this is the very beginning of the investigation, and not the end, and again, I don't claim that 2-line patch resolves all the issues, just that it addresses some.

@pfalcon pfalcon added the area: Shell Shell subsystem label Feb 11, 2019
@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 11, 2019

Not that dead networking allows to keyboard-fuzz this better (working resolution happens too fast otherwise). For example, what happens if we issue new "net dns query" while previous is still active? Can't crash the stupid thing, it seems:

uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Query for 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' sent.

dns: Timeout while resolving name.
uart:~$ 
uart:~$ 
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Query for 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' sent.
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-11)
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-11)
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-11)

dns: Timeout while resolving name.
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Query for 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' sent.
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-11)
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-11)
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-11)

dns: Timeout while resolving name.
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-2)
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-2)
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-2)
uart:~$ net dns query AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A
IPv4 address type
Cannot resolve 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' (-2)

@codecov-io
Copy link

codecov-io commented Feb 11, 2019

Codecov Report

Merging #13260 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13260      +/-   ##
==========================================
+ Coverage   48.76%   48.76%   +<.01%     
==========================================
  Files         319      319              
  Lines       46864    46864              
  Branches    10832    10832              
==========================================
+ Hits        22853    22854       +1     
  Misses      19381    19381              
+ Partials     4630     4629       -1
Impacted Files Coverage Δ
arch/posix/core/posix_core.c 91.91% <0%> (+1.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b44e06e...c55722b. Read the comment docs.

@agross-oss
Copy link
Collaborator

I am concerned about the use of statics. Multiple shells could cause issue.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 11, 2019

I am concerned about the use of statics.

This is supposed to be change "in the right direction" (e.g., by assessment of effects, e.g., previously there was an obviously reproducible crash, while with this change, there's no such), but not necessarily final/ultimate fix.

Multiple shells could cause issue.

@jarz-nordic: Could you comment on this? I guess it would make sense to draw a line regarding what to call "multiple shells". I would draw it like that: is it possible for an end user, using officially available and supported Zephyr configuration, to enable 2 concurrent and independent instances of the system shell, e.g. on 2 different I/O devices?

@jakub-uC
Copy link
Collaborator

jakub-uC commented Feb 11, 2019

@pfalcon : In my opinion static has nothing to do in contexts of multiple shells. Either way you will have a problem if two different shell backends will use these structures at the same time. They will modify them separately in other words they will compete for it.

In order to avoid it you need to have dedicated data sets for each shell backend.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 11, 2019

@jarz-nordic: Ok, so now we have this "shell backend" complication. Can you please explain what "shell backend" is? My intuition would be that "shell backend" is an adapter which allows to multiplex the same shell instance over multiple I/O devices. E.g. there may be shell with backend on UART1 and UART2. Then a character typed on UART1 would appear on UART2, and vice-versa. It would not be the case that you could type on UART1 and UART2 independently, and run different concurrent commands on each of them.

Is it something different in reality?

@jakub-uC
Copy link
Collaborator

@pfalcon : Your understanding is not correct. Each shell backend is separate and independent object. All they have in common is a command set. They can call different commands at the same time. Problem is when two objects will call the same command at the same time and this command is using the same data.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 11, 2019

Each shell backend is separate and independent object.

Then they would rather be called "shell instances" and not "shell backends".

They can call different commands at the same time.

I have no idea how we've arrived at such design. Apparently, we have too much memory to spend. That's of course why we have previously well-working samples lock up on stack overflow, and when they don't lock up, they still don't work: #13251 (comment) .

Anyway, no-static fix is coming.

@pfalcon pfalcon added this to the v1.14.0 milestone Feb 11, 2019
@jakub-uC
Copy link
Collaborator

jakub-uC commented Feb 11, 2019

@pfalcon : In most cases it is enough to have only one backend so I do not consider it as a wasto of memory. If you are already using 2 backends usually they are for different things so there should be no conflict.

Regarding: #13251 (comment)

I found 2 examples not working on frdm_k64f today. For one I've started an issue - it is realted to DT: #13243. The other I've described in #13251.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 11, 2019

In most cases it is enough to have only one backend so I do not consider it as a wasto of memory.

Thanks, what's important in the end is that multiple backends do work concurrently, so @agross-linaro's concerns are very relevant. I'm happy that he knows it, and well, that's kind of knowledge which should be shared and well emphasized. I'd say we should review usage of callbacks, global/static data, and async primitives in shell overall, because I find it hard to believe that DNS is the only case of this issue.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 11, 2019

Anyway, no-static fix is coming.

But yeah, sure, with the world model as described by @jarz-nordic, there's no "simple, frugal" solution. We can't have static. Seemingly here it's utilized to just pass useless "first" param down to callback, but nope, the real cause is to pass a "shell" reference, so it can be printed to. Again, in a simple, frugal world that would be worked around using old good printk(). But here we're trying to do it "correctly" for all these layers we've built. Then duplicating what getaddrinfo() is apparently the only way to do it. Sorry, @agross-linaro, I do this not out of stubbornness, I truly believe that shoving some fat, and especially not forcing newlib in, is important for shell, which is pretty bloated itself (all the strings, whatnot). Well, at least, I have little choice - I can't really call out others bringing in all this "bloat" and then giving in and doing the same. (getaddrinfo() is again has very different usecase, and grows more bloated as we speak, myself adding more POSIX compatibility.)

@pfalcon pfalcon changed the title net: shell: dns: Variables shared with callback should be static. WIP: net: shell: dns: Callback context lifetime should obviously include callback's own lifetime, yet can't be static. Feb 12, 2019
@pfalcon pfalcon force-pushed the net-shell-dns-crash branch from 1495a25 to 43cd3cf Compare February 12, 2019 08:56
@zephyrbot
Copy link
Collaborator

zephyrbot commented Feb 12, 2019

All checks are passing now.

Review history of this comment for details about previous failed status.
Note that some checks might have not completed yet.

@@ -1301,22 +1301,20 @@ static void dns_result_cb(enum dns_resolve_status status,
{
struct net_shell_user_data *data = user_data;
const struct shell *shell = data->shell;
bool *first = data->user_data;
struct k_sem *sem = data->user_data;
printk("in cb, status=%d, sem=%p, isr=%d\n", status, sem, k_is_in_isr());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: missing tab

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: missing tab

Sorry if I didn't make it clear, but this PR is a throw-away example showing deadlock in the shell. The outcome of it can be:

  1. You show me what I'm doing wrong.
  2. You reinstate the functionality of shell_command_enter().
  3. We indeed throw this out and go back to shell: net: Switch to synchronous getaddrinfo call #13191

PR_WARNING("\nTimeout while resolving name.\n");
*first = false;
return;
PR_WARNING("Timeout while resolving name.\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be replaced with shell_warn
All macro definitions in this file #PR_... can be removed

@@ -1321,6 +1321,7 @@ void shell_fprintf(const struct shell *shell, enum shell_vt100_color color,
va_list args = { 0 };

if (k_current_get() != shell->ctx->tid) {
printk("shell: hehe, I'm deadlocking!\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is problem.
Current mutex implementation in shell assumes that all the access to transport medium is synchronized by shell.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current mutex implementation in shell assumes that all the access to transport medium is synchronized by shell.

If only that! It assumes that if shell does something from thread no.1, that request to print something from thread no.2 can be blocked forever. Which doesn't make sense, e.g. if thread no.1 itself blocked on thread no.2 completion.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 12, 2019

Ok, I've come full circle here with tracing @agross-linaro's steps. Nothing works. On each step the shell plays against programmer, and of course programmer can only lose. I'm reading to revive #13191 and +1 it, but then we need do ban any callback usage with the shell. All the processing and printing should be synchronous in command handlers only.

@jarz-nordic, this patch is update to show an issue if you'd like to have a look. Note that previously it was possible for that to work, with shell_command_enter(). It was suddenly removed 75ad61f, and I exactly mean suddenly, because the commit message there just ascertains the fact of removal and gives zero explanation for why it happens. The change comes from #12747 titled "Shell: printf available safely from different threads". So, case in this PR demonstrates that the changes there (besides all the good things they bring) the opposite effect: while it was possible before to synchronize different thread as needed, and use shell, now shell tries to complete in synchronization with threads, leading to deadlocks.

@jakub-uC
Copy link
Collaborator

jakub-uC commented Feb 12, 2019

@pfalcon : I had a discussion about shell_print and I was asked to use it without any "freeze" concept.

To avoid the deadlock you can eihter use printk driven by Logger or you can replace it with shell_print. What is problem here?

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 12, 2019

I head a discussion about shell_print and I was asked to use it without any "freeze" concept.

I'm not sure I understand what you mean. From my PoV based on this PR, shell_command_enter() is useful.

To avoid the deadlock you can eihter use printk driven by Logger or you can use replace it with shell_print. What is problem here?

Again, I'm not sure I understand exactly what you mean. But yes, I seem to forget to ask you to try this patch and show what I get with it:

So, the test example here is "samples/net/dns_resolve". As usual, run it for BOARD=qemu_x86. Then:

uart:~$ net dns query foo
Query for 'foo' sent.
in cb, status=-101, sem=0x004023e4, isr=0
shell: hehe, I'm deadlocking!

So, trying to use shell_fprintf() (via shell_error()) leads to deadlock (by apparent attempt to lock the mutex again.)

@jakub-uC
Copy link
Collaborator

@pfalcon : Yes I got it at some point of time it is not review request but you are just showing the problem.
I will take a look today into this.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 12, 2019

Thanks!

@jakub-uC
Copy link
Collaborator

@pfalcon : I am starting analyzing your use case. I am not familiar with this application architecture. Please just let me know if you are using shell print functions in some interrupt?

@jakub-uC
Copy link
Collaborator

@pfalcon : It was investigated and fixed by @nordic-krch . He will explain you details in several minutes.

@nordic-krch
Copy link
Collaborator

@pfalcon, i've looked at dns query command and provided example of how to fix it here: nordic-krch@b91318a

There was deadlock in that code but it was caused by shell misuse. Shell thread is taking mutex when processing data and releases it after command is processed. Same mutex is taken is printing to shell happens from different thread than shell thread.

Here, command was blocking on semaphore (blocking shell thread, keeping mutex taken) and then callback was attempting to take same mutex. Command was probably blocking to prevent others to print to shell but after #12747 it is not needed. After #12747 it is forbidden to block in shell command and attempt to print from another context (shell documentation should be updated to explicitly state that) as that will lead to deadlock.

In the example fix it is quite simplified as semaphore is not needed and shell can be passed directly as user_data.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 12, 2019

Well, that was smart, but it's not that easy. So, tracing thru all the questions posed.

Please just let me know if you are using shell print functions in some interrupt?

No, not from interrupt. From callback. And with Zephyr architecture, callbacks are (almost?) always called from some another thread. It's very hard to imagine situation when callback fires from the same thread.

There was deadlock in that code but it was caused by shell misuse.

No, no, not so simple ;-). There was deadlock in that code definitely, but it's caused by ...

Shell thread is taking mutex when processing data and releases it after command is processed. Same mutex is taken is printing

So, don't use think that there's mix up and misuse of the mutex in the shell, not user's misuse of the shell? Because you see, using the same mutex for both of:

  • Wrap around shell command handler
  • Merely printing to shell

Don't you think that using the same mutex for both is way to coarse-grained and that's what leads to problems?

Command was probably blocking to prevent others to print to shell

No, command was blocking to not let callback's context on the stack to go away.

After #12747 it is forbidden to block in shell command and attempt to print from another context

That doesn't make any sense! In this PR it's very valid usecase to do that: shell command handler effectively hands over (borrows) shell context to a callback (and all callbacks run in another thread(s)). It even makes sure that there's mutual exclusion between the command handler and a callback - when one prints, another doesn't and vice-versa.

shell documentation should be updated to explicitly state that

No amount of documentation will help, if the subsystem behavior is contra-logical. I am the command handler. When I'm called, I own the shell structure. If I want, I pass that shell structure to someone else for temporary ownership. When I do, I make sure that I don't use it while it's owned bu someone else. Please stay off my way while I'm doing that.

In the example fix it is quite simplified as [...] shell can be passed directly as user_data.

That was smart. That's the solution I was looking for yesterday night when I bravely wrote to @agross-linaro that "we can still do it different way". I just got fooled by that stupid "bool first". While I removed it eventually, I overlooked that we can ditch struct net_shell_user_data completely, thus not store anything on stack, thus return from handler safely without any static's, while callback is still running.

So, it's smart, but it's hack. There's nothing wrong to pass any additional data into callback. As there can be multiple shell instances (which for some reason called "backends"), we can't use static vars. Then the additional data needs to be stored on stack, and the command handler needs to block for lifetime of this additional data (can't just return, or it'll go off the stack). So, @jarz-nordic, @nordic-krch, guys, we didn't finish here yet. This issue will happen again. Unfortunately, it will happen rare enough, so it'll take long time to raise enough level of alert. And each time it will be painful debugging. Because it totally makes no sense to allow callbacks without params to work with shell, but deadlock on callbacks which take params.

@pfalcon pfalcon force-pushed the net-shell-dns-crash branch from 43cd3cf to 64e89ed Compare February 12, 2019 12:23
@pfalcon pfalcon changed the title WIP: net: shell: dns: Callback context lifetime should obviously include callback's own lifetime, yet can't be static. net: shell: dns: Properly manage lifetime of callback data Feb 12, 2019
@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 12, 2019

Ok, the latest patch is based on @nordic-krch's solution (who is credited in the description), plus some cosmetic changes, as the callback is back to run (and print) asynchronously.

As I said above, while this particular issue now (hopefully!) should be fixed, the whole matter of using shell with callbacks is very tangled, and it doesn't need to be.

@jakub-uC
Copy link
Collaborator

jakub-uC commented Feb 12, 2019

First:
@pfalcon , @agross-linaro: I will sprinkle my head with ashes on static. In git diff I missed that data is stored on the stack. In this case of course adding static is a not a good idea. My obvious fault.

Regading your concer:

  • Wrap around shell command handler
  • Merely printing to shell

Don't you think that using the same mutex for both is way to coarse-grained and that's what leads to problems?

With regard to this observation, I believe that this could be improved and we will find a compromise.
The solution may be to release the mutex not before leaving the shell thread but before calling the command handler. It will require a bit of update in the shell but it is probably a good way to solve the problem.

Frankly speaking, I did not consider such an application as you presented hence we are where we are...

user_data.user_data = &first;

ret = dns_get_addr_info(host, qtype, NULL, dns_result_cb, &user_data,
ret = dns_get_addr_info(host, qtype, NULL, dns_result_cb, shell,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've seen warning passing argument 5 of ‘dns_get_addr_info’ discards ‘const’ qualifier from pointer and had to cast shell to (void *)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to cast shell to (void *)

Ack, thanks, added it.

DNS callback needs "struct shell *shell" data structure to pass as
a parameter to shell print. How it was achieved previously is that
it was packaged together with cosmetic "bool first" param into
"struct net_shell_user_data" on the stack, and passed to the
callback. The problem was that the original command handler then
returned, so the "struct net_shell_user_data" on the stack was
overwritten, and the callback crashed on accessing it.

An obvious solution was to make that structure static, but that would
leave to issues still, as turns out we allow system shell to be run
as more than one concurrent instances.

Next solution was to keep this structure on the stack, but block the
command handler until callback is finished. However, that hit a
deadlock due to not well thought out use of a mutex in the shell
printing routines.

The solution presented here is due to @nordic-krch, who noticed that
"bool first" param is indeed cosmetic and not really required. Then
we have only "struct shell *shell" to pass to the callback, and can
do that in callback's pointer param directly, ditching
"struct net_shell_user_data" which needs to be stored on the stack.

Signed-off-by: Paul Sokolovsky <[email protected]>
@pfalcon pfalcon force-pushed the net-shell-dns-crash branch from 64e89ed to c55722b Compare February 12, 2019 13:31
@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 12, 2019

@jarz-nordic

I will sprinkle my head with ashes on static. In git diff I missed that data is stored on the stack. In this case of course adding static is a not a good idea. My obvious fault.

But I don't think it was your fault, because it was me who added "static" in the very first version of this patch, then promptly corrected by you in #13260 (comment) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants