Alleged multiple design and implementation issues with logging #11655

pfalcon · 2018-11-26T15:36:07Z

It's the same story again (#11008, #11417) - logging doesn't log what's expected. Consider following snippet from ethernet driver ISR handler (i.e. it executes on packet reception) run in the context of samples/net/echo_server:

        printk("eth_smsc9220_isr: %x %x\n", SMSC9220->INT_STS, SMSC9220->INT_EN);

        struct eth_context *context = dev->driver_data;

        u32_t int_status = SMSC9220->INT_STS;
        /* Ack pending interrupts */
        SMSC9220->INT_STS = int_status;

        if (int_status & BIT(enum_smsc9220_interrupt_rxstatus_fifo_level)) {
                LOG_DBG("in RX FIFO: pkts: %u, bytes: %u",
                        SMSC9220_BFIELD(RX_FIFO_INF, RXSUSED),
                        SMSC9220_BFIELD(RX_FIFO_INF, RXDUSED));
                printk("this should be shown immediately after 'in RX FIFO'\n");
                u32_t pkt_size = smsc9220_peek_next_packet_size();
                LOG_DBG("peek sz: %u", pkt_size);
...
                LOG_DBG("out RX FIFO: pkts: %u, bytes: %u",
                        SMSC9220_BFIELD(RX_FIFO_INF, RXSUSED),
                        SMSC9220_BFIELD(RX_FIFO_INF, RXDUSED));

The output after startup followed the initial ping packet is (sorry for a lot of extra debug output):

uart:~$ init ok
eth_smsc9220_isr: 0 0
[00:00:03.381,509] <err> eth_smsc9220.eth_init: MAC 52:54:00:12:34:56
uart:~$ eth_initialize
get_cap
***** Booting Zephyr OS zephyr-v1.13.0-2175-g8759b9bec1 *****
eth_smsc9220_isr: 8 8
this should be shown immediately after 'in RX FIFO'
proto: 1
get_cap
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
eth_tx
eth_smsc9220_isr: 8 8
this should be shown immediately after 'in RX FIFO'
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
log_strdup: no mem
eth_tx
[00:00:17.896,378] <dbg> eth_smsc9220.eth_smsc9220_isr: out RX FIFO: pkts: 0, bytes: 0
[00:00:17.896,866] <dbg> net_arp.arp_update: (0x20000a38): src 192.0.2.2
[00:00:17.896,929] <dbg> net_arp.arp_entry_get_pending: (0x20000a38): dst 192.0.2.2
[00:00:17.896,992] <dbg> net_arp.arp_entry_find: (0x20000a38): iface 0x2000b0e0 dst 192.0.2.2
[00:00:17.897,187] <dbg> net_arp.arp_update: (0x20000a38): dst 192.0.2.2 pending 0x200053f8 frag 0x200058a8
[00:00:17.897,377] <dbg> net_arp.arp_entry_find_move_first: (0x20000a38): dst log_strdup pool empty!
[00:00:17.897,580] <dbg> net_arp.arp_entry_find: (0x20000a38): iface 0x2000b0e0 dst log_strdup pool empty!
[00:00:17.898,126] <dbg> net_arp.net_arp_prepare: (0x20000a38): ARP using ll log_strdup pool empty! for IP log_strdup pool empty!

What we should pay attention to is:

On the first entry to ISR (first output eth_smsc9220_isr: 8 8) no debug logging occurs.
On the 2nd entry to ISR, first debug output we get is [00:00:17.896,378] <dbg> eth_smsc9220.eth_smsc9220_isr: out RX FIFO: .... But per the code above, out RX FIFO: logging should occur after in RX FIFO:, and the latter didn't occur.

It should be noted that on the next ping packet, the problems subsides, and there's in RX FIFO: followed by out RX FIFO:.

Based on such a black-box analysis of logging behavior, we can make following conjectures (as conjectures, they may be false):

(Knowing that logging is deferred,) logging has limited buffer to store output messages.
This buffer is circular.
When it overflows, older messages are overwritten SILENTLY.

But p.3 is absolutely IMPOSSIBLE way to deal with the situation. At worst, it should store a message notifying user that some output was lost (this is worst solution, as such a message can be lost itself). Normally, it should switch to non-deferred mode. Under condition logging should be dropped silently as a default option.

The text was updated successfully, but these errors were encountered:

pfalcon · 2018-11-26T16:16:43Z

Ping to all active TSC members: @nashif, @galak, @carlescufi, @MaureenHelm, etc. I'm developing a new ethernet driver and last 2 weeks any more or less work session ended up hitting the wall with logging (see ticket refs above).

I tried to explain why I submitted #10769 and #10770: they were early warning bells that something's not right with how logging and shell subsystems were reimplemented in Zephyr. When rewriting such baseline subsystems as logging and shell, following should be a baseline requirement and acceptance criteria:

New implementation of such subsystems should support all the features, and by default work the same, as the original subsystem (unless they were totally broken, and of course logging/shell in Zephyr wasn't).

Orthogonal to that, here's a baseline requirement for logging:

Logging should just work by default

And well, old logging just worked. And new one just DOES NOT, hence #11007, #11008, #11633, #11417, #10715, etc.

This is not to undermine the great work done by Nordic developers and @nordic-krch in particular. This is to raise before the TSC the question of change control and acceptance criteria handling in the project. Because these great changes should not have been accepted in the shape they went it. If new logging was supposed to replace old logging, it should have been developed and configured to mirror the functionality and behavior of the old logging. Any changes to that behavior should have gone as separate changes, under well rationed change control schedule (e.g. in different Zephyr release to the initial rewrite).

We consequently have 2 choices:

Choice A: Scale back. The biggest problem of the new logging is its deferred nature. That's absolutely great feature, I estimate that I personally would need it in 10% of cases. I.e. 1 debug session in 10, I need it. So far, I had 3 debug sessions with the driver I write, and each time I was blocked by logging's deferred nature.
Choice B: Push thru. If @nordic-krch confirms the matter in this ticket is due to small buffer space, let's configure some no-nonsense value for it, say 16KB. Then let's wait 6-12 months, while users adopt new Zephyr version (Zephyr community is rather sparse as we all know), understand that non-trivial amount of their scarce RAM is wasted for nothing, and wail back. The we can go to choice A.

Actually, we won't have 6-12 month to enjoy the great new features of the new logging - because as soon as we enable 16K of logging buffer bu default, half of boards will fall down in the CI - starting with nRF51. So hopefully, we can go to choice A real soon now.

I would like to raise these 2 topics:

change control/acceptance criteria for the baseline subsystems rewrite;
way forward with the logging

- to the next TSC meeting. Thanks!

andrewboie · 2018-11-27T01:42:41Z

Under the assumption that there is a clean layer of abstraction between the logging API and the buffering/backend, why not have a Kconfig which can give the user a choice whether logging should be buffered/deferred, or done synchronously. We won't have to impose a buffer size on small devices in that case.

I think this is what CONFIG_LOG_INPLACE_PROCESS does?
Edit: I see this is discussed in #11633

Also as a side note, I'd like to learn more why CONFIG_LOG_PROCESS_THREAD_PRIO is default -2
Wouldn't we want it to default to a very low preemptible priority?

nordic-krch · 2018-11-27T06:39:02Z

yes, the issue is probably the buffer size but actual size of the buffer is hard to determine since it is dynamic - you enable debug level in one module and suddenly you need twice bigger buffer. We can improve reporting of such scenarios.

I estimate that I personally would need it in 10% of cases.

Well, that's you. When you have embedded application that has time critical parts that is a must and we seen applications being developed that would suffer by lack of deferred logging. You should also see further since deferred logging allows not only logging to uart but also to flash or over network.

pfalcon · 2018-11-27T08:06:31Z

but actual size of the buffer is hard to determine since it is dynamic - you enable debug level in one module and suddenly you need twice bigger buffer.

Exactly, so only size we can use is conservative, and that's a lot (maybe of course not 16K, but say 5K isn't too little too).

We can improve reporting of such scenarios.

Thanks. That would be indeed the baseline measure as argued above. But I wonder is there's enough experience collected already to take measures beyond that, so I'm really keen to raise this question for this week's TSC. Hope I'll do it right, and hope you can be there.

Well, that's you.

Yep, it's me. But what I do is just what any programmer can/would/will do - enable logging for networking. I didn't even enable global networking logging, just a handful of modules with:

CONFIG_ETHERNET_LOG_LEVEL_DBG=y
CONFIG_NET_ARP_LOG_LEVEL_DBG=y
CONFIG_NET_IPV4_LOG_LEVEL_DBG=y
CONFIG_NET_ICMPV4_LOG_LEVEL_DBG=y

When you have embedded application that has time critical parts that is a must and we seen applications being developed that would suffer by lack of deferred logging. You should also see further since deferred logging allows not only logging to uart but also to flash or over network.

But nobody says "Let's removed deferred logging"! It's absolutely great feature which should be at everyone fingertips for cases when it's really needed. But I see the situation exactly as you guys needed a solution to debug realtime stuff like Bluetooth, but overoptimistically went ahead to make it the default Zephyr logging. Unfortunately, practice shows that it scales poorly to the real-world Zephyr needs, doesn't integrate well (printk issues, etc.). I'm not sure about other subsystems, but networking is a complex, multilayer code. It does produce a lot of logging, and that logging required to debug issues due to the subsystem complexity. Then it's imperative that it works for programmer, not against one.

And even if logging for other individual subsystems is "simple", then Zephyr itself is vast and complex collection and interconnection of subsystems, and there's nothing wrong with enabling logging for multiple subsystems, and then we get to the same problems which are already exposed with networking.

pfalcon · 2018-11-27T14:26:13Z

Related: #11625

pfalcon · 2018-11-28T17:16:47Z

So, these issues were discussed at today's TSC meeting. The general response was that deferred logging is valuable and we want to "get it right", no give up on it. Fine, let's look into that. Current points then:

Get non-deferred logging to actually work (or teach people how to get it working, though again, the point is that it "should just work"), blocked by logging: CONFIG_LOG_INPLACE_PROCESS is not synchronous #11633
Look into buffer sizes used by logging currently, see how they need to be increased for various usecases, then see how we can increase defaults in the "best" way in some definition of the best.
All other things (like never drop logging silently, etc.)

Ping to @pfl in particular based on the comments, I full agree that we need to share such information, and work on common ways to deal with it, rather everyone painfully find workarounds on their own later. That's whole point of me raising this issue. Likewise, I agree that solutions need to be worked out and elaborated first, that's why again this is a ticket, primarily waiting for feedback from the logging maintainer on how to proceed further.

andrewboie · 2018-11-28T22:42:16Z

Fine, let's look into that. Current points then:

    Get non-deferred logging to actually work (or teach people how to get it working, though again, the point is that it "should just work"), blocked by #11633
    Look into buffer sizes used by logging currently, see how they need to be increased for various usecases, then see how we can increase defaults in the "best" way in some definition of the best.
    All other things (like never drop logging silently, etc.)

Please close this and file specific bugs for each of these issues if bugs don't exist already.

Having one bug open where you just have a litany of complaints (with a grandiose title like "LOGGING CAN'T BE TRUSTED") isn't terribly helpful in getting these issues prioritized and resolved.

Thanks

pfalcon · 2018-11-28T23:02:56Z

@andrewboie, not reading description and previous comments and posting pundit-style comments isn't helpful either. Because there's a whole bunch of tickets already, and maybe time is indeed to take "grandiose" measures and do something with logging with bird's eye view of its issues in mind.

Otherwise, if you've got time to comment around, can you please spend it on replying to exactly specific tickets regarding stuff you did before, like #10499, #10911.

Thanks too.

andrewboie · 2018-11-28T23:24:10Z

not reading description

I read the entire discussion.

Because there's a whole bunch of tickets already

Great!

maybe time is indeed to take "grandiose" measures and do something with logging with bird's eye view of its issues in mind.

No. These are bugs. They get solved. Take it to the mailing list or the TSC if you want to get on a soapbox.

pfalcon · 2018-11-29T08:22:48Z

@andrewboie: Please don't close tickets open by other people unless there're resolved. I don't know where manners like that come from, but you already showed your negligence of project technical issues in #5355 and discouraging attitude towards people who report it. Please contain yourself and don't let that spread out across the entire project. If you ignore and don't care about various issues, let the other people who care work towards the solutions.

@andrewboie andrewboie locked and limited conversation to collaborators 9 hours ago

This is especially hilarious. Do you just want your own cozy entry in the latest Github hit, https://github.com/nikolas/github-drama , and that's why you behave like that?

pfalcon · 2018-11-29T08:36:12Z

No. These are bugs. They get solved. Take it to the mailing list or the TSC if you want to get on a soapbox.

This was taken to the TSC, exactly because it's easy to get into "let's ignore and work around issues" behavior, and I'm quite along that way, as #11633 (comment) .

And the biggest bug so far is "how did we get to such problems in the first place", and this is to what this ticket is dedicated. And replies like #11633 (comment) "if shell is used then it is more complex", shows that it's exactly the most important issue to pay attention to.

We can resolve minor technical issues in a loop infinitely, never getting it right, because the problem is on the design and process level.

So, this ticket is to track the progress of disillusionment regarding usage of deferred logging as the default setting. Or alternatively, to track the progress of finding the sweet spot of compromises with it, and me eating my hat, which I'm fine with - there's always something to learn, and witnessing a magic unicorn solution is fully worth it, if only resolving each of them which Zephyr has would be as easy as me eating a hat.

nashif · 2018-11-29T16:00:50Z

@pfalcon your style of opening multiple issues and cross linking them over and over again and using the issue system for open discussions is confusing and irritating.

This is an issue tracking system, not a forum or a mailing list. We need real bug reports with details how they can be reproduced and other information related to a specific bug to help resolve the issues, not open-ended rants that can't be acted upon directly. I am already lost in the number of issues and cross-linking that I (and probably others) are not able to focus on the real problems. This entry is not even marked asa bug, so I am not sure what you are expecting here.

pfalcon · 2018-11-29T16:18:48Z

@nashif

@pfalcon your style of opening multiple issues and cross linking them over and over again and using the issue system for open discussions is confusing and irritating.

Really? ;-) I'm taking a bad example from guys like laperie who open multiple issue and cross link them, e.g. #7591 .

As for "using the issue system for open discussions is confusing and irritating" - that's a hit. What else do you think issue systems are? For discussion of issues and finding ways to solve them.

We need real bug reports with details how they can be reproduced and other information related to a specific bug to help resolve the issues,

In the description the code snippet is provided. If you need the exact setup to reproduce, I'm happy to provide one (it's based on #11680). Otherwise, I'm working with @nordic-krch and happy to provide any info and cooperate with him.

not open-ended rants that can't be acted upon directly.

Well, for me "action" is clear - we need to switch default logging to non-deferred, and this point is argued here. But you guys disagree, that apparently what turn it into a "rant". However this is a technical/project management ticket, and I'd appreciate it being treated as such.

I am already lost in the number of issues and cross-linking that I (and probably others) are not able to focus on the real problems.

Unfortunately, that's because there's an interconnected number of problems. And to keep track of it and untangle it, we exactly need a number of properly cross-references tickets.

This entry is not even marked asa bug, so I am not sure what you are expecting here.

Marked as RFC, thanks.

andrewboie · 2018-11-29T17:31:11Z

@andrewboie: Please don't close tickets open by other people unless there're resolved. I don't know where manners like that come from, but you already showed your negligence of project technical issues in #5355 and discouraging attitude towards people who report it. Please contain yourself and don't let that spread out across the entire project. If you ignore and don't care about various issues, let the other people who care work towards the solutions.

@andrewboie andrewboie locked and limited conversation to collaborators 9 hours ago

This is especially hilarious. Do you just want your own cozy entry in the latest Github hit, nikolas/github-drama , and that's why you behave like that?

You lecturing other people about attitude is irony on a cosmic level.

The amount of time you spent writing these walls of text would be better spent looking at the log code helping to fix it.

Meanwhile, since you are clearly in a lot of agony about this whole situation, I leave you with the world's smallest violin, playing you a sad, sad, song: 🎻

pfalcon · 2018-11-29T17:44:49Z

The amount of time you spent writing these walls of text would be better spent looking at the log code helping to fix it.

May end up even worse waste of time if turns out that I'd pull it in the opposite direction the maintainer and other people want it to be. Need to sync up on the action plan first, waiting for responses from @nordic-krch (he apparently multitasks a lot too, everyone works hard to get that proverbial LTS out and not be ashamed of).

I leave you with the world's smallest violin, playing you a sad, sad, song: violin

Thanks, appreciate it.

pfalcon · 2018-12-27T10:15:09Z

The history unfolds:

2018-12-12 #12040 "Log messages are dropped even with huge log buffer" occurred. Was fixed, but:
2012-12-19 "logging: previous changes to dropped message notification break wifi sample debug" occurred

Now I'm back to hacking on my #11680, and here's what I'm greeted with:

ASSERTION FAIL [!_IsInIsr() || timeout == 0] @ /home/pfalcon/projects-3rdparty/Embedded/Zephyr/zephyr/kernel/msg_q.c:113

Quick grep over subsys/logging/ doesn't show anything directly related to k_msgq, but it's definitely logging-related, because commenting a particular LOG_DBG makes that assert go away. And yeah, that LOG_DBG is in the func called from ISR, and yeah, I need it there ;-).

So, this again should show skeptics among us why this issue exists, precisely titled as it is. It's not about a particular technical problem, it's about paradigmatic approach to the logging design. Particular issues can be fixed. Just to spawn new issues. The process may asymptotically subside. But just imagine that we're trying to reach value of "pi" using an asymptotic formula for "e". We'll never get there using that formula, no matter how hard we try.

nashif · 2019-01-11T18:03:16Z

Please consider changing the title of this issue, the title does not look like an RFC, nor it does look like an enhancement. If this is a bug, it should be marked as such and prioritised to be fixed for the next release.

nordic-krch · 2019-02-12T08:00:24Z

I think that this could be closed now as issues are solved:

LOG_IMMEDIATE provides blocking, synchronous logging which in simplest way (single backend) works as legacy logger
if deferred mode is used then information about log messages being dropped is visible on the output

pfalcon · 2019-02-12T08:37:50Z

* `LOG_IMMEDIATE` provides blocking, synchronous logging which in simplest way (single backend) works as legacy logger

Right. So, I don't think this ticket will be resolved (note: not using the term "closed") until:

IMMEDIATE is made the default case.
Cases like frdm_k64f: samples/net/sockets/echo_server doesn't work #13251 (comment) don't occur. To give a summary what happens there is:

(after long and painful debugging) user: "shell is the culprit!"
shell maintainer: "no, shell is not culprit, look at the logger"
shell and logger maintainer together: "but if make all parts of the system deadlock on each other, completely removing any possibility for deadlock debugging, we can show that there's at least in one case all this stuff doesn't deadlock!"

But the whole subject of this ticket that such cases will reoccur again and again. Unfortunately, they reoccur rarely enough, so it will take a long time in build up a consistent developer opinion regarding that way of working.

nordic-krch · 2019-02-12T08:40:54Z

IMMEDIATE is made the default case.

I hope that will never happen. It seems then that it will stay open for a while.

pfalcon · 2019-10-01T13:35:31Z

   IMMEDIATE is made the default case.

I hope that will never happen.

I guess we have had a close to full circle on this with
#19317 and #19322. The new old printk-based logging isn't made the default, but at least it's available at everyone's fingertips. I guess it makes sense to close this issue.

pfalcon added priority: high High impact/importance bug area: Logging labels Nov 26, 2018

pfalcon assigned nordic-krch Nov 26, 2018

zephyrproject-rtos locked and limited conversation to collaborators Nov 28, 2018

andrewboie closed this as completed Nov 28, 2018

pfalcon reopened this Nov 29, 2018

zephyrproject-rtos unlocked this conversation Nov 29, 2018

pfalcon added the RFC Request For Comments: want input from the community label Nov 29, 2018

nordic-krch mentioned this issue Nov 30, 2018

logger drops log messages without any indication #11763

Closed

pfalcon mentioned this issue Nov 30, 2018

logging: log_core: Clarify message when log_strdup failed allocation #11735

Merged

galak added the Enhancement Changes/Updates/Additions to existing features label Dec 4, 2018

pfalcon mentioned this issue Dec 27, 2018

WIP misc: assert: Allow __ASSERT to call k_panic #12233

Closed

pfalcon changed the title ~~logging cannot be trusted~~ Alleged multiple design and implementation issues with logging Jan 11, 2019

pfalcon mentioned this issue Jan 15, 2019

Logging with CONFIG_LOG_IMMEDIATE=y from ISR leads to assert #12494

Closed

pfalcon mentioned this issue Feb 27, 2019

CONFIG_LOG_IMMEDIATE leads to unobvious faults in unrelated rotines due to stack overflow #13897

Closed

pfalcon removed the priority: high High impact/importance bug label Feb 27, 2019

pfalcon closed this as completed Oct 1, 2019

nashif removed the RFC Request For Comments: want input from the community label May 29, 2025

Alleged multiple design and implementation issues with logging #11655

Alleged multiple design and implementation issues with logging #11655

Comments

pfalcon commented Nov 26, 2018

pfalcon commented Nov 26, 2018

Uh oh!

andrewboie commented Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nordic-krch commented Nov 27, 2018

Uh oh!

pfalcon commented Nov 27, 2018

Uh oh!

pfalcon commented Nov 27, 2018

Uh oh!

pfalcon commented Nov 28, 2018

Uh oh!

andrewboie commented Nov 28, 2018

Uh oh!

pfalcon commented Nov 28, 2018

Uh oh!

andrewboie commented Nov 28, 2018

Uh oh!

pfalcon commented Nov 29, 2018

Uh oh!

pfalcon commented Nov 29, 2018

Uh oh!

nashif commented Nov 29, 2018

Uh oh!

pfalcon commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewboie commented Nov 29, 2018

Uh oh!

pfalcon commented Nov 29, 2018

Uh oh!

pfalcon commented Dec 27, 2018

Uh oh!

nashif commented Jan 11, 2019

Uh oh!

nordic-krch commented Feb 12, 2019

Uh oh!

pfalcon commented Feb 12, 2019

Uh oh!

nordic-krch commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfalcon commented Oct 1, 2019

Uh oh!

andrewboie commented Nov 27, 2018 •

edited

Loading

pfalcon commented Nov 29, 2018 •

edited

Loading

nordic-krch commented Feb 12, 2019 •

edited

Loading