-
Notifications
You must be signed in to change notification settings - Fork 7.5k
net: shell: dns: Properly manage lifetime of callback data #13260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@agross-linaro: Let's start simple. I would like to say right away that we may still arrive at your solution (using getaddrinfo), but the whole idea to trace the path thru the issues, so we (not even just me) understand the matter and how to avoid similar cases - not just fix one case. So, please bare with us as we trace thru this. So, this is the minimal changes required to fix any "net dns query <host>" crashes for me. In particular, I don't see anything special in the behavior of Elsewhere you wrote that with an alternative patch, similar in that nature to this one, you still experienced some issues, and that's what prompted you to go the getaddrinfo way. Can you elaborate on that? Now I tested this only with qemu_x86. Proceeding to test with frdm_k64f, I'm blocked by #13251, #13250 . I may imagine that the real meat of the issue shows itself when the shell is accessed via telnet, and I yet will need to look into that. So, this is the very beginning of the investigation, and not the end, and again, I don't claim that 2-line patch resolves all the issues, just that it addresses some. |
Not that dead networking allows to keyboard-fuzz this better (working resolution happens too fast otherwise). For example, what happens if we issue new "net dns query" while previous is still active? Can't crash the stupid thing, it seems:
|
Codecov Report
@@ Coverage Diff @@
## master #13260 +/- ##
==========================================
+ Coverage 48.76% 48.76% +<.01%
==========================================
Files 319 319
Lines 46864 46864
Branches 10832 10832
==========================================
+ Hits 22853 22854 +1
Misses 19381 19381
+ Partials 4630 4629 -1
Continue to review full report at Codecov.
|
I am concerned about the use of statics. Multiple shells could cause issue. |
This is supposed to be change "in the right direction" (e.g., by assessment of effects, e.g., previously there was an obviously reproducible crash, while with this change, there's no such), but not necessarily final/ultimate fix.
@jarz-nordic: Could you comment on this? I guess it would make sense to draw a line regarding what to call "multiple shells". I would draw it like that: is it possible for an end user, using officially available and supported Zephyr configuration, to enable 2 concurrent and independent instances of the system shell, e.g. on 2 different I/O devices? |
@pfalcon : In my opinion In order to avoid it you need to have dedicated data sets for each shell backend. |
@jarz-nordic: Ok, so now we have this "shell backend" complication. Can you please explain what "shell backend" is? My intuition would be that "shell backend" is an adapter which allows to multiplex the same shell instance over multiple I/O devices. E.g. there may be shell with backend on UART1 and UART2. Then a character typed on UART1 would appear on UART2, and vice-versa. It would not be the case that you could type on UART1 and UART2 independently, and run different concurrent commands on each of them. Is it something different in reality? |
@pfalcon : Your understanding is not correct. Each shell backend is separate and independent object. All they have in common is a command set. They can call different commands at the same time. Problem is when two objects will call the same command at the same time and this command is using the same data. |
Then they would rather be called "shell instances" and not "shell backends".
I have no idea how we've arrived at such design. Apparently, we have too much memory to spend. That's of course why we have previously well-working samples lock up on stack overflow, and when they don't lock up, they still don't work: #13251 (comment) . Anyway, no-static fix is coming. |
@pfalcon : In most cases it is enough to have only one backend so I do not consider it as a wasto of memory. If you are already using 2 backends usually they are for different things so there should be no conflict. Regarding: #13251 (comment) I found 2 examples not working on frdm_k64f today. For one I've started an issue - it is realted to DT: #13243. The other I've described in #13251. |
Thanks, what's important in the end is that multiple backends do work concurrently, so @agross-linaro's concerns are very relevant. I'm happy that he knows it, and well, that's kind of knowledge which should be shared and well emphasized. I'd say we should review usage of callbacks, global/static data, and async primitives in shell overall, because I find it hard to believe that DNS is the only case of this issue. |
But yeah, sure, with the world model as described by @jarz-nordic, there's no "simple, frugal" solution. We can't have static. Seemingly here it's utilized to just pass useless "first" param down to callback, but nope, the real cause is to pass a "shell" reference, so it can be printed to. Again, in a simple, frugal world that would be worked around using old good printk(). But here we're trying to do it "correctly" for all these layers we've built. Then duplicating what getaddrinfo() is apparently the only way to do it. Sorry, @agross-linaro, I do this not out of stubbornness, I truly believe that shoving some fat, and especially not forcing newlib in, is important for shell, which is pretty bloated itself (all the strings, whatnot). Well, at least, I have little choice - I can't really call out others bringing in all this "bloat" and then giving in and doing the same. (getaddrinfo() is again has very different usecase, and grows more bloated as we speak, myself adding more POSIX compatibility.) |
1495a25
to
43cd3cf
Compare
All checks are passing now. Review history of this comment for details about previous failed status. |
subsys/net/ip/net_shell.c
Outdated
@@ -1301,22 +1301,20 @@ static void dns_result_cb(enum dns_resolve_status status, | |||
{ | |||
struct net_shell_user_data *data = user_data; | |||
const struct shell *shell = data->shell; | |||
bool *first = data->user_data; | |||
struct k_sem *sem = data->user_data; | |||
printk("in cb, status=%d, sem=%p, isr=%d\n", status, sem, k_is_in_isr()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: missing tab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: missing tab
Sorry if I didn't make it clear, but this PR is a throw-away example showing deadlock in the shell. The outcome of it can be:
- You show me what I'm doing wrong.
- You reinstate the functionality of shell_command_enter().
- We indeed throw this out and go back to shell: net: Switch to synchronous getaddrinfo call #13191
subsys/net/ip/net_shell.c
Outdated
PR_WARNING("\nTimeout while resolving name.\n"); | ||
*first = false; | ||
return; | ||
PR_WARNING("Timeout while resolving name.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be replaced with shell_warn
All macro definitions in this file #PR_... can be removed
subsys/shell/shell.c
Outdated
@@ -1321,6 +1321,7 @@ void shell_fprintf(const struct shell *shell, enum shell_vt100_color color, | |||
va_list args = { 0 }; | |||
|
|||
if (k_current_get() != shell->ctx->tid) { | |||
printk("shell: hehe, I'm deadlocking!\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is problem.
Current mutex implementation in shell assumes that all the access to transport medium is synchronized by shell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current mutex implementation in shell assumes that all the access to transport medium is synchronized by shell.
If only that! It assumes that if shell does something from thread no.1, that request to print something from thread no.2 can be blocked forever. Which doesn't make sense, e.g. if thread no.1 itself blocked on thread no.2 completion.
Ok, I've come full circle here with tracing @agross-linaro's steps. Nothing works. On each step the shell plays against programmer, and of course programmer can only lose. I'm reading to revive #13191 and +1 it, but then we need do ban any callback usage with the shell. All the processing and printing should be synchronous in command handlers only. @jarz-nordic, this patch is update to show an issue if you'd like to have a look. Note that previously it was possible for that to work, with shell_command_enter(). It was suddenly removed 75ad61f, and I exactly mean suddenly, because the commit message there just ascertains the fact of removal and gives zero explanation for why it happens. The change comes from #12747 titled "Shell: printf available safely from different threads". So, case in this PR demonstrates that the changes there (besides all the good things they bring) the opposite effect: while it was possible before to synchronize different thread as needed, and use shell, now shell tries to complete in synchronization with threads, leading to deadlocks. |
@pfalcon : I had a discussion about To avoid the deadlock you can eihter use printk driven by Logger or you can replace it with |
I'm not sure I understand what you mean. From my PoV based on this PR, shell_command_enter() is useful.
Again, I'm not sure I understand exactly what you mean. But yes, I seem to forget to ask you to try this patch and show what I get with it: So, the test example here is "samples/net/dns_resolve". As usual, run it for BOARD=qemu_x86. Then:
So, trying to use shell_fprintf() (via shell_error()) leads to deadlock (by apparent attempt to lock the mutex again.) |
@pfalcon : Yes I got it at some point of time it is not review request but you are just showing the problem. |
Thanks! |
@pfalcon : I am starting analyzing your use case. I am not familiar with this application architecture. Please just let me know if you are using shell print functions in some interrupt? |
@pfalcon : It was investigated and fixed by @nordic-krch . He will explain you details in several minutes. |
@pfalcon, i've looked at dns query command and provided example of how to fix it here: nordic-krch@b91318a There was deadlock in that code but it was caused by shell misuse. Shell thread is taking mutex when processing data and releases it after command is processed. Same mutex is taken is printing to shell happens from different thread than shell thread. Here, command was blocking on semaphore (blocking shell thread, keeping mutex taken) and then callback was attempting to take same mutex. Command was probably blocking to prevent others to print to shell but after #12747 it is not needed. After #12747 it is forbidden to block in shell command and attempt to print from another context (shell documentation should be updated to explicitly state that) as that will lead to deadlock. In the example fix it is quite simplified as semaphore is not needed and shell can be passed directly as user_data. |
Well, that was smart, but it's not that easy. So, tracing thru all the questions posed.
No, not from interrupt. From callback. And with Zephyr architecture, callbacks are (almost?) always called from some another thread. It's very hard to imagine situation when callback fires from the same thread.
No, no, not so simple ;-). There was deadlock in that code definitely, but it's caused by ...
So, don't use think that there's mix up and misuse of the mutex in the shell, not user's misuse of the shell? Because you see, using the same mutex for both of:
Don't you think that using the same mutex for both is way to coarse-grained and that's what leads to problems?
No, command was blocking to not let callback's context on the stack to go away.
That doesn't make any sense! In this PR it's very valid usecase to do that: shell command handler effectively hands over (borrows) shell context to a callback (and all callbacks run in another thread(s)). It even makes sure that there's mutual exclusion between the command handler and a callback - when one prints, another doesn't and vice-versa.
No amount of documentation will help, if the subsystem behavior is contra-logical. I am the command handler. When I'm called, I own the shell structure. If I want, I pass that shell structure to someone else for temporary ownership. When I do, I make sure that I don't use it while it's owned bu someone else. Please stay off my way while I'm doing that.
That was smart. That's the solution I was looking for yesterday night when I bravely wrote to @agross-linaro that "we can still do it different way". I just got fooled by that stupid "bool first". While I removed it eventually, I overlooked that we can ditch So, it's smart, but it's hack. There's nothing wrong to pass any additional data into callback. As there can be multiple shell instances (which for some reason called "backends"), we can't use static vars. Then the additional data needs to be stored on stack, and the command handler needs to block for lifetime of this additional data (can't just return, or it'll go off the stack). So, @jarz-nordic, @nordic-krch, guys, we didn't finish here yet. This issue will happen again. Unfortunately, it will happen rare enough, so it'll take long time to raise enough level of alert. And each time it will be painful debugging. Because it totally makes no sense to allow callbacks without params to work with shell, but deadlock on callbacks which take params. |
43cd3cf
to
64e89ed
Compare
Ok, the latest patch is based on @nordic-krch's solution (who is credited in the description), plus some cosmetic changes, as the callback is back to run (and print) asynchronously. As I said above, while this particular issue now (hopefully!) should be fixed, the whole matter of using shell with callbacks is very tangled, and it doesn't need to be. |
First: Regading your concer:
With regard to this observation, I believe that this could be improved and we will find a compromise. Frankly speaking, I did not consider such an application as you presented hence we are where we are... |
subsys/net/ip/net_shell.c
Outdated
user_data.user_data = &first; | ||
|
||
ret = dns_get_addr_info(host, qtype, NULL, dns_result_cb, &user_data, | ||
ret = dns_get_addr_info(host, qtype, NULL, dns_result_cb, shell, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've seen warning passing argument 5 of ‘dns_get_addr_info’ discards ‘const’ qualifier from pointer
and had to cast shell to (void *)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to cast shell to
(void *)
Ack, thanks, added it.
DNS callback needs "struct shell *shell" data structure to pass as a parameter to shell print. How it was achieved previously is that it was packaged together with cosmetic "bool first" param into "struct net_shell_user_data" on the stack, and passed to the callback. The problem was that the original command handler then returned, so the "struct net_shell_user_data" on the stack was overwritten, and the callback crashed on accessing it. An obvious solution was to make that structure static, but that would leave to issues still, as turns out we allow system shell to be run as more than one concurrent instances. Next solution was to keep this structure on the stack, but block the command handler until callback is finished. However, that hit a deadlock due to not well thought out use of a mutex in the shell printing routines. The solution presented here is due to @nordic-krch, who noticed that "bool first" param is indeed cosmetic and not really required. Then we have only "struct shell *shell" to pass to the callback, and can do that in callback's pointer param directly, ditching "struct net_shell_user_data" which needs to be stored on the stack. Signed-off-by: Paul Sokolovsky <[email protected]>
64e89ed
to
c55722b
Compare
@jarz-nordic
But I don't think it was your fault, because it was me who added "static" in the very first version of this patch, then promptly corrected by you in #13260 (comment) . |
DNS callback needs "struct shell *shell" data structure to pass as
a parameter to shell print. How it was achieved previously is that
it was packaged together with cosmetic "bool first" param into
"struct net_shell_user_data" on the stack, and passed to the
callback. The problem was that the original command handler then
returned, so the "struct net_shell_user_data" on the stack was
overwritten, and the callback crashed on accessing it.
An obvious solution was to make that structure static, but that would
leave to issues still, as turns out we allow system shell to be run
as more than one concurrent instances.
Next solution was to keep this structure on the stack, but block the
command handler until callback is finished. However, that hit a
deadlock due to not well thought out use of a mutex in the shell
printing routines.
The solution presented here is due to @nordic-krch, who noticed that
"bool first" param is indeed cosmetic and not really required. Then
we have only "struct shell *shell" to pass to the callback, and can
do that in callback's pointer param directly, ditching
"struct net_shell_user_data" which needs to be stored on the stack.
Signed-off-by: Paul Sokolovsky [email protected]