kernel: Non-deterministic and very high ISR latencies #13610

cvinayak · 2019-02-21T05:43:12Z

Describe the bug
Non-deterministic and very high ISR latencies. Bluetooth controller asserts when running samples/bluetooth/central_hr on nrf51_pca10028 board.

[2019-02-21 10:54:43.408] ***** Booting Zephyr OS v1.14.0-rc1-268-g8bd360fa14 *****
[2019-02-21 10:54:43.426] Bluetooth initialized
[2019-02-21 10:54:43.428] [00:00Scanning successfully started
[2019-02-21 10:54:43.434] :00.010,650] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[2019-02-21 10:54:43.443] [00:00:00.010,681] <inf> bt_hci_core: HW Variant: nRF51x (0x0001)
[2019-02-21 10:54:43.453] [00:00:00.010,742] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 0
[2019-02-21 10:54:43.467] [00:00:00.013,031] <inf> bt_h[DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 0, AD data len 14, RSSI -46
[2019-02-21 10:54:43.496] [AD]: 1 data_len 1
[2019-02-21 10:54:43.499] [AD]: 255 data_len 9
[2019-02-21 10:54:44.029] [DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 4, AD data len 0, RSSI -46
[2019-02-21 10:54:46.277] ci_core: Identity: d2:06:aa:d5:f3:9a (random)
[2019-02-21 10:54:46.281] [00:00:00.013,092] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[2019-02-21 10:54:46.292] [00:00:00.013,122] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
[2019-02-21 10:54:47.015] [00:00:03.627,593] <err> bt_ctlr_llsw_ctrl: assert: '!radio_is_ready()' failed
[2019-02-21 10:54:47.023] ***** HARD FAULT *****
[2019-02-21 10:54:47.025] ***** Hardware exception *****
[2019-02-21 10:54:47.028] Current thread ID = 0x200015dc
[2019-02-21 10:54:47.031] Faulting instruction address = 0x12cbe
[2019-02-21 10:54:47.034] Fatal fault in ISR! Spinning...

If a connection establishes with a peripheral in vicinity, the Max. ISR latencies measured is very high, over 100 us.

Legend for the log message:

<inf> bt_ctlr_hci: l: 9, 8, 111; t: 46, 46, 95.

l: Radio peripheral ISR latency Current value of 9us, Minimum of 8us, Maximum of 111us.
t: Radio peripheral ISR CPU usage Current value of 46, Minimum of 46us, Maximum of 95us.

[2019-02-21 10:47:17.917] ***** Booting Zephyr OS v1.14.0-rc1-268-g8bd360fa14 *****
[2019-02-21 10:47:17.933] Bluetooth �***** Booting Zephyr OS v1.14.0-rc1-268-g8bd360fa14 *****
[2019-02-21 10:47:17.983] Bluetooth initialized
[2019-02-21 10:47:17.986] [00:00:0Scanning successfully started
[2019-02-21 10:47:17.991] 0.010,650] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[2019-02-21 10:47:18.001] [00:00:00.010,681] <inf> bt_hci_[DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 0, AD data len 21, RSSI -49
[2019-02-21 10:47:18.014] [AD]: 1 data_len 1
[2019-02-21 10:47:18.016] [AD]: 3 data_len 4
[2019-02-21 10:47:18.023] core: HW Variant: nRF51x (0x0001)
[2019-02-21 10:47:18.222] [00:00:00.010,711] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0Connected: 61:bb:ca:fe:4b:d5 (random)
[2019-02-21 10:47:18.488] [ATTRIBUTE] handle 39
[2019-02-21 10:47:18.838] [ATTRIBUTE] handle 45
[2019-02-21 10:47:18.938] [ATTRIBUTE] handle 47
[2019-02-21 10:47:18.938] [SUBSCRIBED]
[2019-02-21 10:47:21.047] x00) Version 1.14 Build 0
[2019-02-21 10:47:21.051] --- 18 messages dropped ---
[2019-02-21 10:47:21.052] [00:00:00.996,032] <inf> bt_ctlr_hci: l: 9, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.061] [00:00:01.041,198] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 62, 46, 95.
[2019-02-21 10:47:21.068] [00:00:01.091,339] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 52, 46, 95.
[2019-02-21 10:47:21.076] [00:00:01.141,357] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.085] [00:00:01.191,009] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 49, 46, 95.
[2019-02-21 10:47:21.093] [00:00:01.241,333] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.101] [00:00:01.290,954] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 50, 46, 95.
[2019-02-21 10:47:21.109] [00:00:01.341,308] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.116] [00:00:01.441,375] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 62, 46, 95.
[2019-02-21 10:47:21.124] [00:00:01.442,291] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.133] [00:00:01.490,875] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 50, 46, 95.
[2019-02-21 10:47:21.141] [00:00:01.540,496] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 44, 44, 95.
[2019-02-21 10:47:21.149] [00:00:01.941,345] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 44, 95.
[2019-02-21 10:47:21.157] [00:00:01.990,966] <inf> bt_ctlr_hci: l: 9, 8, 111; t: 50, 44, 95.
[2019-02-21 10:47:21.165] [00:00:02.090,454] <inf> bt_ctlr_hci: l: 9, 8, 111; t: 44, 44, 95.

To Reproduce
Checkout 8bd360f
Steps to reproduce the behavior:

Update samples/bluetooth/central_hr/prj.conf to append the following:

CONFIG_BT_CTLR_PROFILE_ISR=y

mkdir build; cd build
cmake -GNinja -DBOARD=nrf51_pca10028 ../../samples/bluetooth/central_hr
ninja
ninja flash
Open terminal, minicom -D /dev/ttyACM0, 115k200

central_hr asserts when scanning without any peripheral in vicinity.
Use iPhone 6S, LightBlue application and enable a virtual peripheral with heart rate service, then

Expected behavior
Console output as in v1.13.0, with acceptable ISR latencies of < 30 us.

***** Booting Zephyr OS zephyr-v1.13.0 *****
[2019-02-21 10:29:53.649] [bt] [INF] hci_vs_init: HW Platform: Nordic Semiconductor (0x0002)
[2019-02-21 10:29:53.655] [bt] [INF] hci_vs_init: HW Variant: nRF51x (0x0001)
[2019-02-21 10:29:53.659] [bt] [INF] hci_vs_init: Firmware: Standard Bluetooth controller (0x00) Version 1.13 Build 0
[2019-02-21 10:29:53.670] [bt] [INF] bt_dev_show_info: Identity: d2:06:aa:d5:f3:9a (random)
[2019-02-21 10:29:53.675] [bt] [INF] bt_dev_show_info: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[2019-02-21 10:29:53.683] [bt] [INF] bt_dev_show_info: LMP: version 5.0 (0x09) subver 0xffff
[2019-02-21 10:29:53.690] Bluetooth initialized
[2019-02-21 10:29:53.696] Scanning successfully started
[2019-02-21 10:29:54.304] [DEVICE]: 66:76:38:cf:3b:a6 (random), AD evt type 3, AD data len 24, RSSI -47
[2019-02-21 10:30:54.235] [DEVICE]: 67:9f:3b:76:38:3e (random), AD evt type 3, AD data len 24, RSSI -46
[2019-02-21 10:31:56.522] [DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 0, AD data len 21, RSSI -64
[2019-02-21 10:31:56.529] [AD]: 1 data_len 1
[2019-02-21 10:31:56.530] [AD]: 3 data_len 4
[2019-02-21 10:31:56.555] Connected: 61:bb:ca:fe:4b:d5 (random)
[2019-02-21 10:31:56.605] [bt] [INF] encode_control: l: 6, 6, 6; t: 61, 61, 61.
[2019-02-21 10:31:56.655] [bt] [INF] encode_control: l: 6, 6, 6; t: 109, 61, 109.
[2019-02-21 10:31:56.705] [bt] [INF] encode_control: l: 6, 6, 6; t: 60, 60, 109.
[2019-02-21 10:31:56.756] [bt] [INF] encode_control: l: 6, 6, 6; t: 97, 60, 109.
[2019-02-21 10:31:56.761] [bt] [INF] encode_control: l: 6, 6, 6; t: 55, 55, 109.
[2019-02-21 10:31:56.805] [bt] [INF] encode_control: l: 6, 6, 6; t: 59, 55, 109.
[2019-02-21 10:31:56.810] [bt] [INF] encode_control: l: 7, 6, 7; t: 110, 55, 110.
[2019-02-21 10:31:56.855] [bt] [INF] encode_control: l: 6, 6, 7; t: 59, 55, 110.
[2019-02-21 10:31:57.056] [ATTRIBUTE] handle 39
[2019-02-21 10:31:57.058] [bt] [INF] encode_control: l: 6, 6, 7; t: 61, 55, 110.
[2019-02-21 10:31:57.306] [bt] [INF] encode_control: l: 6, 6, 7; t: 51, 51, 110.
[2019-02-21 10:31:57.356] [ATTRIBUTE] handle 45
[2019-02-21 10:31:57.358] [bt] [INF] encode_control: l: 6, 6, 7; t: 53, 51, 110.
[2019-02-21 10:31:57.506] [ATTRIBUTE] handle 47
[2019-02-21 10:31:57.508] [SUBSCRIBED]
[2019-02-21 10:31:57.607] [bt] [INF] encode_control: l: 6, 6, 7; t: 66, 51, 110.
[2019-02-21 10:31:57.612] [bt] [INF] encode_control: l: 16, 6, 16; t: 40, 40, 110.
[2019-02-21 10:31:57.706] [bt] [INF] encode_control: l: 6, 6, 16; t: 66, 40, 110.
[2019-02-21 10:31:57.712] [bt] [INF] encode_control: l: 11, 6, 16; t: 41, 40, 110.
[2019-02-21 10:31:57.755] [bt] [INF] encode_control: l: 6, 6, 16; t: 47, 40, 110.
[2019-02-21 10:31:57.805] [bt] [INF] encode_control: l: 6, 6, 16; t: 51, 40, 110.
[2019-02-21 10:31:57.856] [bt] [INF] encode_control: l: 6, 6, 16; t: 48, 40, 110.
[2019-02-21 10:31:57.905] [bt] [INF] encode_control: l: 6, 6, 16; t: 51, 40, 110.
[2019-02-21 10:31:58.006] [bt] [INF] encode_control: l: 6, 6, 16; t: 48, 40, 110.
[2019-02-21 10:31:58.056] [bt] [INF] encode_control: l: 6, 6, 16; t: 51, 40, 110.
[2019-02-21 10:31:58.155] [bt] [INF] encode_control: l: 6, 6, 16; t: 46, 40, 110.
[2019-02-21 10:31:58.205] [bt] [INF] encode_control: l: 6, 6, 16; t: 45, 40, 110

Impact
Showstopper, Blocker for supporting Bluetooth on nRF51 Series and critical for nRF52 series due to bad ISR latencies.

Screenshots or console output
Refer to the ones added in the description of the bug.

Environment (please complete the following information):

OS: Linux (Archlinux distribution) and MacOS with HomeBrew
Toolchain: Native Archlinux distributed arm-none-eabi- toolchain, gcc version 8.2.0 (Arch repository)
Commit SHA or Version used: 8bd360f

Additional context
None.

The text was updated successfully, but these errors were encountered:

cvinayak · 2019-02-21T05:46:07Z

Attention: @andyross and @pabigot towards kernel scheduling/timeout and nrf_rtc_timer.c changes respectively, since v1.13.0.

cvinayak · 2019-02-21T10:37:26Z

@pabigot reverting nrf_rtc_timer.c changes to 37fbff6 reduces the latencies by ~50 us

cvinayak · 2019-02-21T10:39:09Z

@andyross and @pabigot, @anangl the nrf_rtc_timer.c was lockless in v1.13.0. Do you think it can be made lockless for v1.14.0 too?

pabigot · 2019-02-21T12:37:17Z

@pabigot reverting nrf_rtc_timer.c changes to 37fbff6 reduces the latencies by ~50 us

@cvinayak To clarify: you're saying that reverting my changes made in #12605 reduces latencies?

Odd. My own experiments on the ISR timing itself showed significant decreases in lock duration with that rework, especially on nRF51. It's not immediately obvious how the timer ISR could be at fault.

There are a lot more irq_lock() calls (hidden as spinlock uses; see #12537) throughout Zephyr now so it may be that the direct cause is something else that's affected by the duration of the clock maintenance code.

I'm not working Zephyr at the moment so will have to leave this to @pizi-nordic.

nashif · 2019-02-21T12:57:10Z

@cvinayak any chance you can add a testcase for this to be able to catch such regressions in the future?

carlescufi · 2019-02-21T16:12:18Z

@cvinayak are you sure about your commit? The one you reference is from @anangl (37fbff6) and doesn't seem to be related.

andyross · 2019-02-21T17:08:47Z

@cvinayak A timer ISR was never really lockless -- it has to call into the timeout system and out into arbitrary callbacks, all of which need synchronization. Someone should definitely check the bisection though -- I agree that the referenced commit doesn't seem particularly relevant.

andyross · 2019-02-21T17:22:19Z

Oh, and to be clear:

There are a lot more irq_lock() calls (hidden as spinlock uses; see #12537) throughout Zephyr now

There really are not, and I'm not sure why you'd think so. Synchronization in the spinlockification patch was almost entirely 1:1, and where it differed it was uniformly in the direction of reducing latency by breaking existing recursive irq_lock usage. If we've regressed it's a bug and we should fix it.

pabigot · 2019-02-21T17:40:06Z

There are a lot more irq_lock() calls (hidden as spinlock uses; see #12537) throughout Zephyr now

There really are not, and I'm not sure why you'd think so.

I was mixing two things, one being an unsupported assumption that since a lot of material was added/changed since 1.13.0, there would be more locks of any sort. This command:

for d in kernel drivers ; do 
  for h in v1.13.0 HEAD ; do  
    echo -n "$d $h " ; git grep -P 'k_spin_lock|irq_lock' ${h} ${d} | wc ;
   done ;
done

shows the rough number of lock operations has decreased from 106 to 87 in in kernel, and increased from 68 to 98 in drivers. That's consistent with your expectation that the total number has not increased since 1.13.0 (I think kernel is more representative than driver here), so thanks for calling that out.

But I also meant to indicate that if you're looking for an irq_lock you'll undercount them if you don't also look for spin_lock which didn't exist in 1.13.0. Whatever's causing this regression might be due to changes under a lock that's now spelled k_spin_lock().

andyross · 2019-02-21T17:41:35Z

I'm sorry, I misread @cvinayak bisection report. If reverting only the changes to nrf_rtc_timer.c between 37fbf fixes your latency problems, then that points to a pretty tight list of patches mostly from @pabigot (FWIW: these all look fine to me):

75f77db432 include: misc: util.h: Rename min/max to MIN/MAX
ac36886e62 drivers: nrf: timer: add inline qualifier where inlining is intended
084363a0dc drivers: timer: nrf: refactor for speed and correctness
71882ff8c4 drivers: timer: nrf: drop unnecessary counter mask
4b24e88fa4 drivers: timer: nrf: use irq_lock instead of spinlock

If you're reverting your whole tree to that, then I'd guess the more likely culprit is the spinlockification series. Note that most of it should be bisectable independently (though IIRC there's a header dependency that @carlescufi discovered that has to be manually fixed) so you can zero in on the specific subsystem at fault. That series "shouldn't" have changed latencies at all, it's just refactoring.

cvinayak · 2019-02-22T07:13:16Z

@andyross, I only reverted changes to nrf_rtc_timer.c and saw reduced latencies.

@pabigot below are two images of logic analyzer measurements of irq_locking durations (channel 0), this is by visual inspection and I may have missed bigger latency measurements (files attached for perusal).
Old nrf_rtc_timer.c, I could see locking upto ~16us

After the optimizations, latencies upto ~21 us

@andyross not sure why there is so many times CPU wake up (in 10 ms intervals, but not always):

Ideally, there should be no kernel/application code running when there is no data transmissions on Bluetooth. The measurement was taken, when my iPhone was sleeping, only BLE link maintained by the periodic empty packets at the connection intervals.

Next week, I will try to profile all interrupt disables by the kernel/arch, to get a better idea of latencies and CPU use for basically only maintaining empty Bluetooth connection.

pabigot · 2019-02-22T11:42:56Z

@cvinayak @andyross Some of the wakeups are explained in
#12542 (comment); in short even with default tickless the timeslicing infrastructure is default enabled. Which is a mess because drivers like I2C are neither re-entrant nor threadsafe, so a timeslice that suspends a thread with a transfer in progress could introduce some very difficult to find misbehavior. A tickless system by default should have run-to-completion threads. IMO.

Without knowing exactly what channel 0 measures I can't explain how you see 21 us lock durations, but I never saw anything greater than 2.88 us on an nRF51, nor did I see anything greater than 11 us before my changes. Information about timing related to my changes to nrf_rtc_timer is in #12605.

andyross · 2019-02-22T18:04:47Z

even with default tickless the timeslicing infrastructure is default enabled

Timeslicing has always been "default y" in kconfig. But it only affects threads that are at preemptible priorities, which aren't the default. It did indeed get reworked along with the rest of the timing system during the churn last fall, but it's almost entirely orthogonal to tickless.

I'll admit I remain confused, though. There really doesn't seem to be anything in that patch series that should impact latencies. The locking strategy is basically the same.

andrewboie · 2019-02-22T18:52:16Z

Which is a mess because drivers like I2C are neither re-entrant nor threadsafe, so a timeslice that suspends a thread with a transfer in progress could introduce some very difficult to find misbehavior.

If a driver isn't reentrant or thread-safe, that is a serious bug in the driver and not a problem in the kernel...what I2C drivers are like this?

rljordan-zz · 2019-03-07T21:31:14Z

@SavinayDharmappa Please take a look at this.

Per zephyrproject-rtos#13610, recent changes to this driver seem to have introduced unexpected latency regressions. This patch effectively reverts these patches which changed the meat of the driver: ac36886 drivers: nrf: timer: add inline qualifier where inlining is intended 084363a drivers: timer: nrf: refactor for speed and correctness 71882ff drivers: timer: nrf: drop unnecessary counter mask 4b24e88 drivers: timer: nrf: use irq_lock instead of spinlock While backporting these seemingly unrelated hygiene patches: 7cbdb6c drivers/timer: Restore non-tickless tick count behavior d30c9ae drivers: nrf_power_clock: Migrate to DTS. 75f77db include: misc: util.h: Rename min/max to MIN/MAX Signed-off-by: Andy Ross <[email protected]>

andyross · 2019-03-07T23:01:11Z

Let's at least start with a reversion per the comments above, as that seems lowest risk. Can you guys try the linked patch to see if that resolve the issue as @cvinayak reported? If so, then that seems simple. If not we'll have to try something else.

Unfortunately I have neither a nrf51 board or iPhone (though I can probably scrounge up the former) so local reproduction is going to take some work.

pabigot · 2019-03-07T23:27:05Z

FWIW I have a very hard time believing that the irq_lock around that section of nrf_rtc_timer could be held for 105 us. That's an extremely long time; the longest operation in that sequence is an integer division that's less than 5 us, and the longest I saw the lock held was less than 20 us. So I don't think reverting just the changes to that file will resolve the issue completely.

I would believe there are other irq_lock occurrences that could be that long.

andyross · 2019-03-07T23:32:42Z

I don't disagree, but at this point we need to take what we can get. @cvinayak can confirm, if it works we can merge and then fix it at our leisure.

andyross · 2019-03-11T16:06:35Z

Recent discussion in PR #14176 seems to indicate that it addresses this issue. For anyone here who might have missed that, please reveiw.

Per #13610, recent changes to this driver seem to have introduced unexpected latency regressions. This patch effectively reverts these patches which changed the meat of the driver: ac36886 drivers: nrf: timer: add inline qualifier where inlining is intended 084363a drivers: timer: nrf: refactor for speed and correctness 71882ff drivers: timer: nrf: drop unnecessary counter mask 4b24e88 drivers: timer: nrf: use irq_lock instead of spinlock While backporting these seemingly unrelated hygiene patches: 7cbdb6c drivers/timer: Restore non-tickless tick count behavior d30c9ae drivers: nrf_power_clock: Migrate to DTS. 75f77db include: misc: util.h: Rename min/max to MIN/MAX Signed-off-by: Andy Ross <[email protected]>

andyross · 2019-03-15T16:15:46Z

Think we can close this one. I opened #14577 to track the need for performance work in this ISR. Please add notes there as needed.

The system clock on Nordic is based on 32 KiHz timer. Scheduling ticks requires that deadlines be specified with a timer counter that aligns to a system clock. With the Zephyr default 100 clocks-per-sec configuration this results in 100 ticks every 32700 ticks of the cycle timer. This reveals two problems: * The uptime clock misrepresents elapsed time because it runs 0.208% (68/32768) faster than the best available clock; * Calculation of timer counter compare values often requires an integer division and multiply operation to produce a value that's a multiple of clock-ticks-per-second. Integer division on the Cortex-M1 nRF51 is done in software with a (value-dependent) algorithm with a non-constant runtime that can be significant. This can produce missed Bluetooth deadlines as discussed in issue zephyrproject-rtos#14577 and others. By changing the default divisor to one that evenly divides the 2^15 clock rate the time interrupts are disabled to manage timers is significantly reduced. Note that the central_hr configuration described in issue zephyrproject-rtos#13610 does not distinguish latency due to timer management from other irq_block/spinlock regions, and the maximum observed latency will still exceed the nominal 10 us allowed maximum. However this does occur much less frequently than changing the timer deadline which can happen multiple times per tick. Signed-off-by: Peter A. Bigot <[email protected]>

The default system clock on all Nordic devices is based on a 32 KiHz (2^15 Hz) timer. Scheduling ticks requires that deadlines be specified with a timer counter that aligns to a system clock. With the Zephyr default 100 clocks-per-sec configuration this results in 100 ticks every 32700 ticks of the cycle timer. This reveals two problems: * The uptime clock misrepresents elapsed time because it runs 0.208% (68/32768) faster than the best available clock; * Calculation of timer counter compare values often requires an integer division and multiply operation to produce a value that's a multiple of clock-ticks-per-second. Integer division on the Cortex-M1 nRF51 is done in software with a (value-dependent) algorithm with a non-constant runtime that can be significant. This can produce missed Bluetooth deadlines as discussed in upstream zephyrproject-rtos#14577 and others. By changing the default divisor to one that evenly divides the 2^15 clock rate the time interrupts are disabled to manage timers is significantly reduced, as is the error between uptime and real time. Do this at the top level, moving SYS_CLOCK_HW_CYCLES_PER_SEC there as well since the two parameters are related. Note that the central_hr configuration described in upstream zephyrproject-rtos#13610 does not distinguish latency due to timer management from other irq_block/spinlock regions, and the maximum observed latency will still exceed the nominal 10 us allowed maximum. However this does occur much less frequently than changing the timer deadline which can happen multiple times per tick. Signed-off-by: Peter A. Bigot <[email protected]>

The default system clock on all Nordic devices is based on a 32 KiHz (2^15 Hz) timer. Scheduling ticks requires that deadlines be specified with a timer counter that aligns to a system clock. With the Zephyr default 100 clocks-per-sec configuration this results in 100 ticks every 32700 ticks of the cycle timer. This reveals two problems: * The uptime clock misrepresents elapsed time because it runs 0.208% (68/32768) faster than the best available clock; * Calculation of timer counter compare values often requires an integer division and multiply operation to produce a value that's a multiple of clock-ticks-per-second. Integer division on the Cortex-M1 nRF51 is done in software with a (value-dependent) algorithm with a non-constant runtime that can be significant. This can produce missed Bluetooth deadlines as discussed in upstream #14577 and others. By changing the default divisor to one that evenly divides the 2^15 clock rate the time interrupts are disabled to manage timers is significantly reduced, as is the error between uptime and real time. Do this at the top level, moving SYS_CLOCK_HW_CYCLES_PER_SEC there as well since the two parameters are related. Note that the central_hr configuration described in upstream #13610 does not distinguish latency due to timer management from other irq_block/spinlock regions, and the maximum observed latency will still exceed the nominal 10 us allowed maximum. However this does occur much less frequently than changing the timer deadline which can happen multiple times per tick. Signed-off-by: Peter A. Bigot <[email protected]>

lopsided98 · 2019-06-30T21:50:07Z

I am running into this issue on Zephyr master (d045bd7) in my own application on the nrf51_ble400 board. This application is set up as a BLE peripheral and has quite a few features enabled, including BLE security and and custom sensor driver than uses a GPIO interrupt. I have not yet gotten a chance to try to create a minimal example that reproduces the problem. The crash only occurs when a device is connected.

This is the log just before the crash, with CONFIG_BT_CTLR_PROFILE_ISR=y. The latencies do not seem unreasonable compared to #13610 (comment).

[00:07:16.023,956] <inf> bt_ctlr_hci: l: 31, 7, 55; t: 74, 58, 111.
[00:07:16.033,355] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:20.070,220] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 80, 58, 111.
[00:07:20.083,526] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:20.119,049] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 103, 58, 111.
[00:07:20.133,361] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:23.283,477] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 95, 58, 111.
[00:07:23.287,780] <inf> bt_ctlr_hci: l: 8, 7, 55; t: 74, 58, 111.
[00:07:24.994,018] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 61, 58, 111.
[00:07:25.033,386] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:28.333,465] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 95, 58, 111.
[00:07:28.357,818] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:41.812,988] <inf> bt_ctlr_hci: l: 33, 7, 55; t: 74, 58, 111.
[00:07:41.833,343] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:45.078,948] <err> bt_ctlr_llsw_ctrl: assert: '!radio_is_ready()' failed
***** HARD FAULT *****
***** Hardware exception *****
Current thread ID = 0x20000010
Faulting instruction address = 0x12c68
Fatal fault in ISR! Spinning...

cvinayak · 2019-07-01T00:46:30Z

@lopsided98 do you have a master commit which was fine? could you check if v1.14-branch works fine for your application? I would try to bisect to find the commit beyond which the controller is seeing increased latencies. Is this reproducible in the vanilla samples/bluetooth/peripheral sample?

Thinking aloud, assuming the assert happened in an encrypted connection, the Radio ISR CPU time too seems to have increased, maybe changes in controller code too may be contributing to the increased overall CPU use. I should do some GPIO debug pin profiling to confirm the new numbers seem above.

cvinayak added the bug The issue is a bug, or the PR is fixing a bug label Feb 21, 2019

cvinayak added this to the v1.14.0 milestone Feb 21, 2019

cvinayak assigned nashif, andyross, pabigot, cvinayak, carlescufi and anangl Feb 21, 2019

carlescufi added the priority: high High impact/importance bug label Feb 21, 2019

carlescufi assigned pizi-nordic and andrewboie and unassigned pabigot and anangl Feb 21, 2019

nashif pinned this issue Feb 21, 2019

nashif unpinned this issue Feb 21, 2019

nashif removed their assignment Feb 22, 2019

rljordan-zz assigned SavinayDharmappa Mar 7, 2019

andyross mentioned this issue Mar 7, 2019

drivers/timer/nrf_rtc_timer: Revert recent changes #14176

Merged

ManojSubbarao unassigned SavinayDharmappa Mar 11, 2019

andyross assigned andyross and unassigned andyross, cvinayak, carlescufi and pizi-nordic Mar 12, 2019

andyross mentioned this issue Mar 15, 2019

Address latency/performance in nRF51 timer ISR #14577

Closed

andyross closed this as completed Mar 15, 2019

pabigot mentioned this issue Apr 27, 2019

re-resolve broken k_busy_wait on Nordic platforms #15707

Merged

kernel: Non-deterministic and very high ISR latencies #13610

kernel: Non-deterministic and very high ISR latencies #13610

Comments

cvinayak commented Feb 21, 2019

cvinayak commented Feb 21, 2019

Uh oh!

cvinayak commented Feb 21, 2019

Uh oh!

cvinayak commented Feb 21, 2019 • edited by carlescufi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pabigot commented Feb 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nashif commented Feb 21, 2019

Uh oh!

carlescufi commented Feb 21, 2019

Uh oh!

andyross commented Feb 21, 2019

Uh oh!

andyross commented Feb 21, 2019

Uh oh!

pabigot commented Feb 21, 2019

Uh oh!

andyross commented Feb 21, 2019

Uh oh!

cvinayak commented Feb 22, 2019

Uh oh!

pabigot commented Feb 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyross commented Feb 22, 2019

Uh oh!

andrewboie commented Feb 22, 2019

Uh oh!

rljordan-zz commented Mar 7, 2019

Uh oh!

andyross commented Mar 7, 2019

Uh oh!

pabigot commented Mar 7, 2019

Uh oh!

andyross commented Mar 7, 2019

Uh oh!

andyross commented Mar 11, 2019

Uh oh!

andyross commented Mar 15, 2019

Uh oh!

lopsided98 commented Jun 30, 2019

Uh oh!

cvinayak commented Jul 1, 2019

Uh oh!

cvinayak commented Feb 21, 2019 •

edited by carlescufi

Loading

pabigot commented Feb 21, 2019 •

edited

Loading

pabigot commented Feb 22, 2019 •

edited

Loading