Skip to content

Alleged multiple design and implementation issues with logging #11655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pfalcon opened this issue Nov 26, 2018 · 21 comments
Closed

Alleged multiple design and implementation issues with logging #11655

pfalcon opened this issue Nov 26, 2018 · 21 comments
Assignees
Labels
area: Logging Enhancement Changes/Updates/Additions to existing features

Comments

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 26, 2018

It's the same story again (#11008, #11417) - logging doesn't log what's expected. Consider following snippet from ethernet driver ISR handler (i.e. it executes on packet reception) run in the context of samples/net/echo_server:

        printk("eth_smsc9220_isr: %x %x\n", SMSC9220->INT_STS, SMSC9220->INT_EN);

        struct eth_context *context = dev->driver_data;

        u32_t int_status = SMSC9220->INT_STS;
        /* Ack pending interrupts */
        SMSC9220->INT_STS = int_status;

        if (int_status & BIT(enum_smsc9220_interrupt_rxstatus_fifo_level)) {
                LOG_DBG("in RX FIFO: pkts: %u, bytes: %u",
                        SMSC9220_BFIELD(RX_FIFO_INF, RXSUSED),
                        SMSC9220_BFIELD(RX_FIFO_INF, RXDUSED));
                printk("this should be shown immediately after 'in RX FIFO'\n");
                u32_t pkt_size = smsc9220_peek_next_packet_size();
                LOG_DBG("peek sz: %u", pkt_size);
...
                LOG_DBG("out RX FIFO: pkts: %u, bytes: %u",
                        SMSC9220_BFIELD(RX_FIFO_INF, RXSUSED),
                        SMSC9220_BFIELD(RX_FIFO_INF, RXDUSED));

The output after startup followed the initial ping packet is (sorry for a lot of extra debug output):

uart:~$ init ok
eth_smsc9220_isr: 0 0
[00:00:03.381,509] <err> eth_smsc9220.eth_init: MAC 52:54:00:12:34:56
uart:~$ eth_initialize
get_cap
***** Booting Zephyr OS zephyr-v1.13.0-2175-g8759b9bec1 *****
eth_smsc9220_isr: 8 8
this should be shown immediately after 'in RX FIFO'
proto: 1
get_cap
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
eth_tx
eth_smsc9220_isr: 8 8
this should be shown immediately after 'in RX FIFO'
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
eth_tx
[00:00:17.896,378] <dbg> eth_smsc9220.eth_smsc9220_isr: out RX FIFO: pkts: 0, bytes: 0
[00:00:17.896,866] <dbg> net_arp.arp_update: (0x20000a38): src 192.0.2.2
[00:00:17.896,929] <dbg> net_arp.arp_entry_get_pending: (0x20000a38): dst 192.0.2.2
[00:00:17.896,992] <dbg> net_arp.arp_entry_find: (0x20000a38): iface 0x2000b0e0 dst 192.0.2.2
[00:00:17.897,187] <dbg> net_arp.arp_update: (0x20000a38): dst 192.0.2.2 pending 0x200053f8 frag 0x200058a8
[00:00:17.897,377] <dbg> net_arp.arp_entry_find_move_first: (0x20000a38): dst log_strdup pool empty!
[00:00:17.897,580] <dbg> net_arp.arp_entry_find: (0x20000a38): iface 0x2000b0e0 dst log_strdup pool empty!
[00:00:17.898,126] <dbg> net_arp.net_arp_prepare: (0x20000a38): ARP using ll log_strdup pool empty! for IP log_strdup pool empty!

What we should pay attention to is:

  1. On the first entry to ISR (first output eth_smsc9220_isr: 8 8) no debug logging occurs.
  2. On the 2nd entry to ISR, first debug output we get is [00:00:17.896,378] <dbg> eth_smsc9220.eth_smsc9220_isr: out RX FIFO: .... But per the code above, out RX FIFO: logging should occur after in RX FIFO:, and the latter didn't occur.

It should be noted that on the next ping packet, the problems subsides, and there's in RX FIFO: followed by out RX FIFO:.

Based on such a black-box analysis of logging behavior, we can make following conjectures (as conjectures, they may be false):

  1. (Knowing that logging is deferred,) logging has limited buffer to store output messages.
  2. This buffer is circular.
  3. When it overflows, older messages are overwritten SILENTLY.

But p.3 is absolutely IMPOSSIBLE way to deal with the situation. At worst, it should store a message notifying user that some output was lost (this is worst solution, as such a message can be lost itself). Normally, it should switch to non-deferred mode. Under condition logging should be dropped silently as a default option.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 26, 2018

Ping to all active TSC members: @nashif, @galak, @carlescufi, @MaureenHelm, etc. I'm developing a new ethernet driver and last 2 weeks any more or less work session ended up hitting the wall with logging (see ticket refs above).

I tried to explain why I submitted #10769 and #10770: they were early warning bells that something's not right with how logging and shell subsystems were reimplemented in Zephyr. When rewriting such baseline subsystems as logging and shell, following should be a baseline requirement and acceptance criteria:

New implementation of such subsystems should support all the features, and by default work the same, as the original subsystem (unless they were totally broken, and of course logging/shell in Zephyr wasn't).

Orthogonal to that, here's a baseline requirement for logging:

Logging should just work by default

And well, old logging just worked. And new one just DOES NOT, hence #11007, #11008, #11633, #11417, #10715, etc.

This is not to undermine the great work done by Nordic developers and @nordic-krch in particular. This is to raise before the TSC the question of change control and acceptance criteria handling in the project. Because these great changes should not have been accepted in the shape they went it. If new logging was supposed to replace old logging, it should have been developed and configured to mirror the functionality and behavior of the old logging. Any changes to that behavior should have gone as separate changes, under well rationed change control schedule (e.g. in different Zephyr release to the initial rewrite).

We consequently have 2 choices:

  • Choice A: Scale back. The biggest problem of the new logging is its deferred nature. That's absolutely great feature, I estimate that I personally would need it in 10% of cases. I.e. 1 debug session in 10, I need it. So far, I had 3 debug sessions with the driver I write, and each time I was blocked by logging's deferred nature.

  • Choice B: Push thru. If @nordic-krch confirms the matter in this ticket is due to small buffer space, let's configure some no-nonsense value for it, say 16KB. Then let's wait 6-12 months, while users adopt new Zephyr version (Zephyr community is rather sparse as we all know), understand that non-trivial amount of their scarce RAM is wasted for nothing, and wail back. The we can go to choice A.

Actually, we won't have 6-12 month to enjoy the great new features of the new logging - because as soon as we enable 16K of logging buffer bu default, half of boards will fall down in the CI - starting with nRF51. So hopefully, we can go to choice A real soon now.

I would like to raise these 2 topics:

  • change control/acceptance criteria for the baseline subsystems rewrite;
  • way forward with the logging

- to the next TSC meeting. Thanks!

@andrewboie
Copy link
Contributor

andrewboie commented Nov 27, 2018

Under the assumption that there is a clean layer of abstraction between the logging API and the buffering/backend, why not have a Kconfig which can give the user a choice whether logging should be buffered/deferred, or done synchronously. We won't have to impose a buffer size on small devices in that case.

I think this is what CONFIG_LOG_INPLACE_PROCESS does?
Edit: I see this is discussed in #11633

Also as a side note, I'd like to learn more why CONFIG_LOG_PROCESS_THREAD_PRIO is default -2
Wouldn't we want it to default to a very low preemptible priority?

@nordic-krch
Copy link
Collaborator

yes, the issue is probably the buffer size but actual size of the buffer is hard to determine since it is dynamic - you enable debug level in one module and suddenly you need twice bigger buffer. We can improve reporting of such scenarios.

I estimate that I personally would need it in 10% of cases.

Well, that's you. When you have embedded application that has time critical parts that is a must and we seen applications being developed that would suffer by lack of deferred logging. You should also see further since deferred logging allows not only logging to uart but also to flash or over network.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 27, 2018

but actual size of the buffer is hard to determine since it is dynamic - you enable debug level in one module and suddenly you need twice bigger buffer.

Exactly, so only size we can use is conservative, and that's a lot (maybe of course not 16K, but say 5K isn't too little too).

We can improve reporting of such scenarios.

Thanks. That would be indeed the baseline measure as argued above. But I wonder is there's enough experience collected already to take measures beyond that, so I'm really keen to raise this question for this week's TSC. Hope I'll do it right, and hope you can be there.

Well, that's you.

Yep, it's me. But what I do is just what any programmer can/would/will do - enable logging for networking. I didn't even enable global networking logging, just a handful of modules with:

CONFIG_ETHERNET_LOG_LEVEL_DBG=y
CONFIG_NET_ARP_LOG_LEVEL_DBG=y
CONFIG_NET_IPV4_LOG_LEVEL_DBG=y
CONFIG_NET_ICMPV4_LOG_LEVEL_DBG=y

When you have embedded application that has time critical parts that is a must and we seen applications being developed that would suffer by lack of deferred logging. You should also see further since deferred logging allows not only logging to uart but also to flash or over network.

But nobody says "Let's removed deferred logging"! It's absolutely great feature which should be at everyone fingertips for cases when it's really needed. But I see the situation exactly as you guys needed a solution to debug realtime stuff like Bluetooth, but overoptimistically went ahead to make it the default Zephyr logging. Unfortunately, practice shows that it scales poorly to the real-world Zephyr needs, doesn't integrate well (printk issues, etc.). I'm not sure about other subsystems, but networking is a complex, multilayer code. It does produce a lot of logging, and that logging required to debug issues due to the subsystem complexity. Then it's imperative that it works for programmer, not against one.

And even if logging for other individual subsystems is "simple", then Zephyr itself is vast and complex collection and interconnection of subsystems, and there's nothing wrong with enabling logging for multiple subsystems, and then we get to the same problems which are already exposed with networking.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 27, 2018

Related: #11625

@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 28, 2018

So, these issues were discussed at today's TSC meeting. The general response was that deferred logging is valuable and we want to "get it right", no give up on it. Fine, let's look into that. Current points then:

  1. Get non-deferred logging to actually work (or teach people how to get it working, though again, the point is that it "should just work"), blocked by logging: CONFIG_LOG_INPLACE_PROCESS is not synchronous #11633
  2. Look into buffer sizes used by logging currently, see how they need to be increased for various usecases, then see how we can increase defaults in the "best" way in some definition of the best.
  3. All other things (like never drop logging silently, etc.)

Ping to @pfl in particular based on the comments, I full agree that we need to share such information, and work on common ways to deal with it, rather everyone painfully find workarounds on their own later. That's whole point of me raising this issue. Likewise, I agree that solutions need to be worked out and elaborated first, that's why again this is a ticket, primarily waiting for feedback from the logging maintainer on how to proceed further.

@andrewboie
Copy link
Contributor

Fine, let's look into that. Current points then:

    Get non-deferred logging to actually work (or teach people how to get it working, though again, the point is that it "should just work"), blocked by #11633
    Look into buffer sizes used by logging currently, see how they need to be increased for various usecases, then see how we can increase defaults in the "best" way in some definition of the best.
    All other things (like never drop logging silently, etc.)

Please close this and file specific bugs for each of these issues if bugs don't exist already.

Having one bug open where you just have a litany of complaints (with a grandiose title like "LOGGING CAN'T BE TRUSTED") isn't terribly helpful in getting these issues prioritized and resolved.

Thanks

@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 28, 2018

@andrewboie, not reading description and previous comments and posting pundit-style comments isn't helpful either. Because there's a whole bunch of tickets already, and maybe time is indeed to take "grandiose" measures and do something with logging with bird's eye view of its issues in mind.

Otherwise, if you've got time to comment around, can you please spend it on replying to exactly specific tickets regarding stuff you did before, like #10499, #10911.

Thanks too.

@andrewboie
Copy link
Contributor

not reading description

I read the entire discussion.

Because there's a whole bunch of tickets already

Great!

maybe time is indeed to take "grandiose" measures and do something with logging with bird's eye view of its issues in mind.

No. These are bugs. They get solved. Take it to the mailing list or the TSC if you want to get on a soapbox.

@zephyrproject-rtos zephyrproject-rtos locked and limited conversation to collaborators Nov 28, 2018
@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 29, 2018

@andrewboie: Please don't close tickets open by other people unless there're resolved. I don't know where manners like that come from, but you already showed your negligence of project technical issues in #5355 and discouraging attitude towards people who report it. Please contain yourself and don't let that spread out across the entire project. If you ignore and don't care about various issues, let the other people who care work towards the solutions.

@andrewboie andrewboie locked and limited conversation to collaborators 9 hours ago

This is especially hilarious. Do you just want your own cozy entry in the latest Github hit, https://github.com/nikolas/github-drama , and that's why you behave like that?

@pfalcon pfalcon reopened this Nov 29, 2018
@zephyrproject-rtos zephyrproject-rtos unlocked this conversation Nov 29, 2018
@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 29, 2018

No. These are bugs. They get solved. Take it to the mailing list or the TSC if you want to get on a soapbox.

This was taken to the TSC, exactly because it's easy to get into "let's ignore and work around issues" behavior, and I'm quite along that way, as #11633 (comment) .

And the biggest bug so far is "how did we get to such problems in the first place", and this is to what this ticket is dedicated. And replies like #11633 (comment) "if shell is used then it is more complex", shows that it's exactly the most important issue to pay attention to.

We can resolve minor technical issues in a loop infinitely, never getting it right, because the problem is on the design and process level.

So, this ticket is to track the progress of disillusionment regarding usage of deferred logging as the default setting. Or alternatively, to track the progress of finding the sweet spot of compromises with it, and me eating my hat, which I'm fine with - there's always something to learn, and witnessing a magic unicorn solution is fully worth it, if only resolving each of them which Zephyr has would be as easy as me eating a hat.

@nashif
Copy link
Member

nashif commented Nov 29, 2018

@pfalcon your style of opening multiple issues and cross linking them over and over again and using the issue system for open discussions is confusing and irritating.

This is an issue tracking system, not a forum or a mailing list. We need real bug reports with details how they can be reproduced and other information related to a specific bug to help resolve the issues, not open-ended rants that can't be acted upon directly. I am already lost in the number of issues and cross-linking that I (and probably others) are not able to focus on the real problems. This entry is not even marked asa bug, so I am not sure what you are expecting here.

@pfalcon pfalcon added the RFC Request For Comments: want input from the community label Nov 29, 2018
@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 29, 2018

@nashif

@pfalcon your style of opening multiple issues and cross linking them over and over again and using the issue system for open discussions is confusing and irritating.

Really? ;-) I'm taking a bad example from guys like laperie who open multiple issue and cross link them, e.g. #7591 .

As for "using the issue system for open discussions is confusing and irritating" - that's a hit. What else do you think issue systems are? For discussion of issues and finding ways to solve them.

We need real bug reports with details how they can be reproduced and other information related to a specific bug to help resolve the issues,

In the description the code snippet is provided. If you need the exact setup to reproduce, I'm happy to provide one (it's based on #11680). Otherwise, I'm working with @nordic-krch and happy to provide any info and cooperate with him.

not open-ended rants that can't be acted upon directly.

Well, for me "action" is clear - we need to switch default logging to non-deferred, and this point is argued here. But you guys disagree, that apparently what turn it into a "rant". However this is a technical/project management ticket, and I'd appreciate it being treated as such.

I am already lost in the number of issues and cross-linking that I (and probably others) are not able to focus on the real problems.

Unfortunately, that's because there's an interconnected number of problems. And to keep track of it and untangle it, we exactly need a number of properly cross-references tickets.

This entry is not even marked asa bug, so I am not sure what you are expecting here.

Marked as RFC, thanks.

@andrewboie
Copy link
Contributor

@andrewboie: Please don't close tickets open by other people unless there're resolved. I don't know where manners like that come from, but you already showed your negligence of project technical issues in #5355 and discouraging attitude towards people who report it. Please contain yourself and don't let that spread out across the entire project. If you ignore and don't care about various issues, let the other people who care work towards the solutions.

@andrewboie andrewboie locked and limited conversation to collaborators 9 hours ago

This is especially hilarious. Do you just want your own cozy entry in the latest Github hit, nikolas/github-drama , and that's why you behave like that?

You lecturing other people about attitude is irony on a cosmic level.

The amount of time you spent writing these walls of text would be better spent looking at the log code helping to fix it.

Meanwhile, since you are clearly in a lot of agony about this whole situation, I leave you with the world's smallest violin, playing you a sad, sad, song: 🎻

@pfalcon
Copy link
Collaborator Author

pfalcon commented Nov 29, 2018

The amount of time you spent writing these walls of text would be better spent looking at the log code helping to fix it.

May end up even worse waste of time if turns out that I'd pull it in the opposite direction the maintainer and other people want it to be. Need to sync up on the action plan first, waiting for responses from @nordic-krch (he apparently multitasks a lot too, everyone works hard to get that proverbial LTS out and not be ashamed of).

I leave you with the world's smallest violin, playing you a sad, sad, song: violin

Thanks, appreciate it.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Dec 27, 2018

The history unfolds:

2018-12-12 #12040 "Log messages are dropped even with huge log buffer" occurred. Was fixed, but:
2012-12-19 "logging: previous changes to dropped message notification break wifi sample debug" occurred

Now I'm back to hacking on my #11680, and here's what I'm greeted with:

ASSERTION FAIL [!_IsInIsr() || timeout == 0] @ /home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/kernel/msg_q.c:113

Quick grep over subsys/logging/ doesn't show anything directly related to k_msgq, but it's definitely logging-related, because commenting a particular LOG_DBG makes that assert go away. And yeah, that LOG_DBG is in the func called from ISR, and yeah, I need it there ;-).

So, this again should show skeptics among us why this issue exists, precisely titled as it is. It's not about a particular technical problem, it's about paradigmatic approach to the logging design. Particular issues can be fixed. Just to spawn new issues. The process may asymptotically subside. But just imagine that we're trying to reach value of "pi" using an asymptotic formula for "e". We'll never get there using that formula, no matter how hard we try.

@nashif
Copy link
Member

nashif commented Jan 11, 2019

Please consider changing the title of this issue, the title does not look like an RFC, nor it does look like an enhancement. If this is a bug, it should be marked as such and prioritised to be fixed for the next release.

@pfalcon pfalcon changed the title logging cannot be trusted Alleged multiple design and implementation issues with logging Jan 11, 2019
@nordic-krch
Copy link
Collaborator

I think that this could be closed now as issues are solved:

  • LOG_IMMEDIATE provides blocking, synchronous logging which in simplest way (single backend) works as legacy logger
  • if deferred mode is used then information about log messages being dropped is visible on the output

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 12, 2019

* `LOG_IMMEDIATE` provides blocking, synchronous logging which in simplest way (single backend) works as legacy logger

Right. So, I don't think this ticket will be resolved (note: not using the term "closed") until:

  1. IMMEDIATE is made the default case.
  2. Cases like frdm_k64f: samples/net/sockets/echo_server doesn't work #13251 (comment) don't occur. To give a summary what happens there is:
  • (after long and painful debugging) user: "shell is the culprit!"
  • shell maintainer: "no, shell is not culprit, look at the logger"
  • shell and logger maintainer together: "but if make all parts of the system deadlock on each other, completely removing any possibility for deadlock debugging, we can show that there's at least in one case all this stuff doesn't deadlock!"

But the whole subject of this ticket that such cases will reoccur again and again. Unfortunately, they reoccur rarely enough, so it will take a long time in build up a consistent developer opinion regarding that way of working.

@nordic-krch
Copy link
Collaborator

nordic-krch commented Feb 12, 2019

  1. IMMEDIATE is made the default case.

I hope that will never happen. It seems then that it will stay open for a while.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Oct 1, 2019

   IMMEDIATE is made the default case.

I hope that will never happen.

I guess we have had a close to full circle on this with
#19317 and #19322. The new old printk-based logging isn't made the default, but at least it's available at everyone's fingertips. I guess it makes sense to close this issue.

@pfalcon pfalcon closed this as completed Oct 1, 2019
@nashif nashif removed the RFC Request For Comments: want input from the community label May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Logging Enhancement Changes/Updates/Additions to existing features
Projects
None yet
Development

No branches or pull requests

5 participants