Skip to content

kernel: Non-deterministic and very high ISR latencies #13610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cvinayak opened this issue Feb 21, 2019 · 28 comments
Closed

kernel: Non-deterministic and very high ISR latencies #13610

cvinayak opened this issue Feb 21, 2019 · 28 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Milestone

Comments

@cvinayak
Copy link
Collaborator

Describe the bug
Non-deterministic and very high ISR latencies. Bluetooth controller asserts when running samples/bluetooth/central_hr on nrf51_pca10028 board.

[2019-02-21 10:54:43.408] ***** Booting Zephyr OS v1.14.0-rc1-268-g8bd360fa14 *****
[2019-02-21 10:54:43.426] Bluetooth initialized
[2019-02-21 10:54:43.428] [00:00Scanning successfully started
[2019-02-21 10:54:43.434] :00.010,650] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[2019-02-21 10:54:43.443] [00:00:00.010,681] <inf> bt_hci_core: HW Variant: nRF51x (0x0001)
[2019-02-21 10:54:43.453] [00:00:00.010,742] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0x00) Version 1.14 Build 0
[2019-02-21 10:54:43.467] [00:00:00.013,031] <inf> bt_h[DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 0, AD data len 14, RSSI -46
[2019-02-21 10:54:43.496] [AD]: 1 data_len 1
[2019-02-21 10:54:43.499] [AD]: 255 data_len 9
[2019-02-21 10:54:44.029] [DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 4, AD data len 0, RSSI -46
[2019-02-21 10:54:46.277] ci_core: Identity: d2:06:aa:d5:f3:9a (random)
[2019-02-21 10:54:46.281] [00:00:00.013,092] <inf> bt_hci_core: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[2019-02-21 10:54:46.292] [00:00:00.013,122] <inf> bt_hci_core: LMP: version 5.0 (0x09) subver 0xffff
[2019-02-21 10:54:47.015] [00:00:03.627,593] <err> bt_ctlr_llsw_ctrl: assert: '!radio_is_ready()' failed
[2019-02-21 10:54:47.023] ***** HARD FAULT *****
[2019-02-21 10:54:47.025] ***** Hardware exception *****
[2019-02-21 10:54:47.028] Current thread ID = 0x200015dc
[2019-02-21 10:54:47.031] Faulting instruction address = 0x12cbe
[2019-02-21 10:54:47.034] Fatal fault in ISR! Spinning...

If a connection establishes with a peripheral in vicinity, the Max. ISR latencies measured is very high, over 100 us.

Legend for the log message:

<inf> bt_ctlr_hci: l: 9, 8, 111; t: 46, 46, 95.

l: Radio peripheral ISR latency Current value of 9us, Minimum of 8us, Maximum of 111us.
t: Radio peripheral ISR CPU usage Current value of 46, Minimum of 46us, Maximum of 95us.

[2019-02-21 10:47:17.917] ***** Booting Zephyr OS v1.14.0-rc1-268-g8bd360fa14 *****
[2019-02-21 10:47:17.933] Bluetooth �***** Booting Zephyr OS v1.14.0-rc1-268-g8bd360fa14 *****
[2019-02-21 10:47:17.983] Bluetooth initialized
[2019-02-21 10:47:17.986] [00:00:0Scanning successfully started
[2019-02-21 10:47:17.991] 0.010,650] <inf> bt_hci_core: HW Platform: Nordic Semiconductor (0x0002)
[2019-02-21 10:47:18.001] [00:00:00.010,681] <inf> bt_hci_[DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 0, AD data len 21, RSSI -49
[2019-02-21 10:47:18.014] [AD]: 1 data_len 1
[2019-02-21 10:47:18.016] [AD]: 3 data_len 4
[2019-02-21 10:47:18.023] core: HW Variant: nRF51x (0x0001)
[2019-02-21 10:47:18.222] [00:00:00.010,711] <inf> bt_hci_core: Firmware: Standard Bluetooth controller (0Connected: 61:bb:ca:fe:4b:d5 (random)
[2019-02-21 10:47:18.488] [ATTRIBUTE] handle 39
[2019-02-21 10:47:18.838] [ATTRIBUTE] handle 45
[2019-02-21 10:47:18.938] [ATTRIBUTE] handle 47
[2019-02-21 10:47:18.938] [SUBSCRIBED]
[2019-02-21 10:47:21.047] x00) Version 1.14 Build 0
[2019-02-21 10:47:21.051] --- 18 messages dropped ---
[2019-02-21 10:47:21.052] [00:00:00.996,032] <inf> bt_ctlr_hci: l: 9, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.061] [00:00:01.041,198] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 62, 46, 95.
[2019-02-21 10:47:21.068] [00:00:01.091,339] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 52, 46, 95.
[2019-02-21 10:47:21.076] [00:00:01.141,357] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.085] [00:00:01.191,009] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 49, 46, 95.
[2019-02-21 10:47:21.093] [00:00:01.241,333] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.101] [00:00:01.290,954] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 50, 46, 95.
[2019-02-21 10:47:21.109] [00:00:01.341,308] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.116] [00:00:01.441,375] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 62, 46, 95.
[2019-02-21 10:47:21.124] [00:00:01.442,291] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 46, 95.
[2019-02-21 10:47:21.133] [00:00:01.490,875] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 50, 46, 95.
[2019-02-21 10:47:21.141] [00:00:01.540,496] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 44, 44, 95.
[2019-02-21 10:47:21.149] [00:00:01.941,345] <inf> bt_ctlr_hci: l: 8, 8, 111; t: 46, 44, 95.
[2019-02-21 10:47:21.157] [00:00:01.990,966] <inf> bt_ctlr_hci: l: 9, 8, 111; t: 50, 44, 95.
[2019-02-21 10:47:21.165] [00:00:02.090,454] <inf> bt_ctlr_hci: l: 9, 8, 111; t: 44, 44, 95.

To Reproduce
Checkout 8bd360f
Steps to reproduce the behavior:

  1. Update samples/bluetooth/central_hr/prj.conf to append the following:
CONFIG_BT_CTLR_PROFILE_ISR=y
  1. mkdir build; cd build
  2. cmake -GNinja -DBOARD=nrf51_pca10028 ../../samples/bluetooth/central_hr
  3. ninja
  4. ninja flash
  5. Open terminal, minicom -D /dev/ttyACM0, 115k200

central_hr asserts when scanning without any peripheral in vicinity.
Use iPhone 6S, LightBlue application and enable a virtual peripheral with heart rate service, then

Expected behavior
Console output as in v1.13.0, with acceptable ISR latencies of < 30 us.

***** Booting Zephyr OS zephyr-v1.13.0 *****
[2019-02-21 10:29:53.649] [bt] [INF] hci_vs_init: HW Platform: Nordic Semiconductor (0x0002)
[2019-02-21 10:29:53.655] [bt] [INF] hci_vs_init: HW Variant: nRF51x (0x0001)
[2019-02-21 10:29:53.659] [bt] [INF] hci_vs_init: Firmware: Standard Bluetooth controller (0x00) Version 1.13 Build 0
[2019-02-21 10:29:53.670] [bt] [INF] bt_dev_show_info: Identity: d2:06:aa:d5:f3:9a (random)
[2019-02-21 10:29:53.675] [bt] [INF] bt_dev_show_info: HCI: version 5.0 (0x09) revision 0x0000, manufacturer 0x05f1
[2019-02-21 10:29:53.683] [bt] [INF] bt_dev_show_info: LMP: version 5.0 (0x09) subver 0xffff
[2019-02-21 10:29:53.690] Bluetooth initialized
[2019-02-21 10:29:53.696] Scanning successfully started
[2019-02-21 10:29:54.304] [DEVICE]: 66:76:38:cf:3b:a6 (random), AD evt type 3, AD data len 24, RSSI -47
[2019-02-21 10:30:54.235] [DEVICE]: 67:9f:3b:76:38:3e (random), AD evt type 3, AD data len 24, RSSI -46
[2019-02-21 10:31:56.522] [DEVICE]: 61:bb:ca:fe:4b:d5 (random), AD evt type 0, AD data len 21, RSSI -64
[2019-02-21 10:31:56.529] [AD]: 1 data_len 1
[2019-02-21 10:31:56.530] [AD]: 3 data_len 4
[2019-02-21 10:31:56.555] Connected: 61:bb:ca:fe:4b:d5 (random)
[2019-02-21 10:31:56.605] [bt] [INF] encode_control: l: 6, 6, 6; t: 61, 61, 61.
[2019-02-21 10:31:56.655] [bt] [INF] encode_control: l: 6, 6, 6; t: 109, 61, 109.
[2019-02-21 10:31:56.705] [bt] [INF] encode_control: l: 6, 6, 6; t: 60, 60, 109.
[2019-02-21 10:31:56.756] [bt] [INF] encode_control: l: 6, 6, 6; t: 97, 60, 109.
[2019-02-21 10:31:56.761] [bt] [INF] encode_control: l: 6, 6, 6; t: 55, 55, 109.
[2019-02-21 10:31:56.805] [bt] [INF] encode_control: l: 6, 6, 6; t: 59, 55, 109.
[2019-02-21 10:31:56.810] [bt] [INF] encode_control: l: 7, 6, 7; t: 110, 55, 110.
[2019-02-21 10:31:56.855] [bt] [INF] encode_control: l: 6, 6, 7; t: 59, 55, 110.
[2019-02-21 10:31:57.056] [ATTRIBUTE] handle 39
[2019-02-21 10:31:57.058] [bt] [INF] encode_control: l: 6, 6, 7; t: 61, 55, 110.
[2019-02-21 10:31:57.306] [bt] [INF] encode_control: l: 6, 6, 7; t: 51, 51, 110.
[2019-02-21 10:31:57.356] [ATTRIBUTE] handle 45
[2019-02-21 10:31:57.358] [bt] [INF] encode_control: l: 6, 6, 7; t: 53, 51, 110.
[2019-02-21 10:31:57.506] [ATTRIBUTE] handle 47
[2019-02-21 10:31:57.508] [SUBSCRIBED]
[2019-02-21 10:31:57.607] [bt] [INF] encode_control: l: 6, 6, 7; t: 66, 51, 110.
[2019-02-21 10:31:57.612] [bt] [INF] encode_control: l: 16, 6, 16; t: 40, 40, 110.
[2019-02-21 10:31:57.706] [bt] [INF] encode_control: l: 6, 6, 16; t: 66, 40, 110.
[2019-02-21 10:31:57.712] [bt] [INF] encode_control: l: 11, 6, 16; t: 41, 40, 110.
[2019-02-21 10:31:57.755] [bt] [INF] encode_control: l: 6, 6, 16; t: 47, 40, 110.
[2019-02-21 10:31:57.805] [bt] [INF] encode_control: l: 6, 6, 16; t: 51, 40, 110.
[2019-02-21 10:31:57.856] [bt] [INF] encode_control: l: 6, 6, 16; t: 48, 40, 110.
[2019-02-21 10:31:57.905] [bt] [INF] encode_control: l: 6, 6, 16; t: 51, 40, 110.
[2019-02-21 10:31:58.006] [bt] [INF] encode_control: l: 6, 6, 16; t: 48, 40, 110.
[2019-02-21 10:31:58.056] [bt] [INF] encode_control: l: 6, 6, 16; t: 51, 40, 110.
[2019-02-21 10:31:58.155] [bt] [INF] encode_control: l: 6, 6, 16; t: 46, 40, 110.
[2019-02-21 10:31:58.205] [bt] [INF] encode_control: l: 6, 6, 16; t: 45, 40, 110

Impact
Showstopper, Blocker for supporting Bluetooth on nRF51 Series and critical for nRF52 series due to bad ISR latencies.

Screenshots or console output
Refer to the ones added in the description of the bug.

Environment (please complete the following information):

  • OS: Linux (Archlinux distribution) and MacOS with HomeBrew
  • Toolchain: Native Archlinux distributed arm-none-eabi- toolchain, gcc version 8.2.0 (Arch repository)
  • Commit SHA or Version used: 8bd360f

Additional context
None.

@cvinayak cvinayak added the bug The issue is a bug, or the PR is fixing a bug label Feb 21, 2019
@cvinayak cvinayak added this to the v1.14.0 milestone Feb 21, 2019
@cvinayak
Copy link
Collaborator Author

Attention: @andyross and @pabigot towards kernel scheduling/timeout and nrf_rtc_timer.c changes respectively, since v1.13.0.

@carlescufi carlescufi added the priority: high High impact/importance bug label Feb 21, 2019
@carlescufi carlescufi assigned pizi-nordic and andrewboie and unassigned pabigot and anangl Feb 21, 2019
@cvinayak
Copy link
Collaborator Author

@pabigot reverting nrf_rtc_timer.c changes to 37fbff6 reduces the latencies by ~50 us

image

@cvinayak
Copy link
Collaborator Author

cvinayak commented Feb 21, 2019

@andyross and @pabigot, @anangl the nrf_rtc_timer.c was lockless in v1.13.0. Do you think it can be made lockless for v1.14.0 too?

@pabigot
Copy link
Collaborator

pabigot commented Feb 21, 2019

@pabigot reverting nrf_rtc_timer.c changes to 37fbff6 reduces the latencies by ~50 us

@cvinayak To clarify: you're saying that reverting my changes made in #12605 reduces latencies?

Odd. My own experiments on the ISR timing itself showed significant decreases in lock duration with that rework, especially on nRF51. It's not immediately obvious how the timer ISR could be at fault.

There are a lot more irq_lock() calls (hidden as spinlock uses; see #12537) throughout Zephyr now so it may be that the direct cause is something else that's affected by the duration of the clock maintenance code.

I'm not working Zephyr at the moment so will have to leave this to @pizi-nordic.

@nashif
Copy link
Member

nashif commented Feb 21, 2019

@cvinayak any chance you can add a testcase for this to be able to catch such regressions in the future?

@carlescufi
Copy link
Member

@cvinayak are you sure about your commit? The one you reference is from @anangl (37fbff6) and doesn't seem to be related.

@nashif nashif pinned this issue Feb 21, 2019
@nashif nashif unpinned this issue Feb 21, 2019
@andyross
Copy link
Collaborator

@cvinayak A timer ISR was never really lockless -- it has to call into the timeout system and out into arbitrary callbacks, all of which need synchronization. Someone should definitely check the bisection though -- I agree that the referenced commit doesn't seem particularly relevant.

@andyross
Copy link
Collaborator

Oh, and to be clear:

There are a lot more irq_lock() calls (hidden as spinlock uses; see #12537) throughout Zephyr now

There really are not, and I'm not sure why you'd think so. Synchronization in the spinlockification patch was almost entirely 1:1, and where it differed it was uniformly in the direction of reducing latency by breaking existing recursive irq_lock usage. If we've regressed it's a bug and we should fix it.

@pabigot
Copy link
Collaborator

pabigot commented Feb 21, 2019

There are a lot more irq_lock() calls (hidden as spinlock uses; see #12537) throughout Zephyr now

There really are not, and I'm not sure why you'd think so.

I was mixing two things, one being an unsupported assumption that since a lot of material was added/changed since 1.13.0, there would be more locks of any sort. This command:

for d in kernel drivers ; do 
  for h in v1.13.0 HEAD ; do  
    echo -n "$d $h " ; git grep -P 'k_spin_lock|irq_lock' ${h} ${d} | wc ;
   done ;
done

shows the rough number of lock operations has decreased from 106 to 87 in in kernel, and increased from 68 to 98 in drivers. That's consistent with your expectation that the total number has not increased since 1.13.0 (I think kernel is more representative than driver here), so thanks for calling that out.

But I also meant to indicate that if you're looking for an irq_lock you'll undercount them if you don't also look for spin_lock which didn't exist in 1.13.0. Whatever's causing this regression might be due to changes under a lock that's now spelled k_spin_lock().

@andyross
Copy link
Collaborator

I'm sorry, I misread @cvinayak bisection report. If reverting only the changes to nrf_rtc_timer.c between 37fbf fixes your latency problems, then that points to a pretty tight list of patches mostly from @pabigot (FWIW: these all look fine to me):

75f77db432 include: misc: util.h: Rename min/max to MIN/MAX
ac36886e62 drivers: nrf: timer: add inline qualifier where inlining is intended
084363a0dc drivers: timer: nrf: refactor for speed and correctness
71882ff8c4 drivers: timer: nrf: drop unnecessary counter mask
4b24e88fa4 drivers: timer: nrf: use irq_lock instead of spinlock

If you're reverting your whole tree to that, then I'd guess the more likely culprit is the spinlockification series. Note that most of it should be bisectable independently (though IIRC there's a header dependency that @carlescufi discovered that has to be manually fixed) so you can zero in on the specific subsystem at fault. That series "shouldn't" have changed latencies at all, it's just refactoring.

@cvinayak
Copy link
Collaborator Author

@andyross, I only reverted changes to nrf_rtc_timer.c and saw reduced latencies.

@pabigot below are two images of logic analyzer measurements of irq_locking durations (channel 0), this is by visual inspection and I may have missed bigger latency measurements (files attached for perusal).
Old nrf_rtc_timer.c, I could see locking upto ~16us
image

After the optimizations, latencies upto ~21 us
image

@andyross not sure why there is so many times CPU wake up (in 10 ms intervals, but not always):
image

Ideally, there should be no kernel/application code running when there is no data transmissions on Bluetooth. The measurement was taken, when my iPhone was sleeping, only BLE link maintained by the periodic empty packets at the connection intervals.

Next week, I will try to profile all interrupt disables by the kernel/arch, to get a better idea of latencies and CPU use for basically only maintaining empty Bluetooth connection.

@pabigot
Copy link
Collaborator

pabigot commented Feb 22, 2019

@cvinayak @andyross Some of the wakeups are explained in
#12542 (comment); in short even with default tickless the timeslicing infrastructure is default enabled. Which is a mess because drivers like I2C are neither re-entrant nor threadsafe, so a timeslice that suspends a thread with a transfer in progress could introduce some very difficult to find misbehavior. A tickless system by default should have run-to-completion threads. IMO.

Without knowing exactly what channel 0 measures I can't explain how you see 21 us lock durations, but I never saw anything greater than 2.88 us on an nRF51, nor did I see anything greater than 11 us before my changes. Information about timing related to my changes to nrf_rtc_timer is in #12605.

@nashif nashif removed their assignment Feb 22, 2019
@andyross
Copy link
Collaborator

even with default tickless the timeslicing infrastructure is default enabled

Timeslicing has always been "default y" in kconfig. But it only affects threads that are at preemptible priorities, which aren't the default. It did indeed get reworked along with the rest of the timing system during the churn last fall, but it's almost entirely orthogonal to tickless.

I'll admit I remain confused, though. There really doesn't seem to be anything in that patch series that should impact latencies. The locking strategy is basically the same.

@andrewboie
Copy link
Contributor

Which is a mess because drivers like I2C are neither re-entrant nor threadsafe, so a timeslice that suspends a thread with a transfer in progress could introduce some very difficult to find misbehavior.

If a driver isn't reentrant or thread-safe, that is a serious bug in the driver and not a problem in the kernel...what I2C drivers are like this?

@rljordan-zz
Copy link

@SavinayDharmappa Please take a look at this.

andyross pushed a commit to andyross/zephyr that referenced this issue Mar 7, 2019
Per zephyrproject-rtos#13610, recent changes to this driver seem to have introduced
unexpected latency regressions.  This patch effectively reverts these
patches which changed the meat of the driver:

ac36886 drivers: nrf: timer: add inline qualifier where
           inlining is intended
084363a drivers: timer: nrf: refactor for speed and correctness
71882ff drivers: timer: nrf: drop unnecessary counter mask
4b24e88 drivers: timer: nrf: use irq_lock instead of spinlock

While backporting these seemingly unrelated hygiene patches:

7cbdb6c drivers/timer: Restore non-tickless tick count behavior
d30c9ae drivers: nrf_power_clock: Migrate to DTS.
75f77db include: misc: util.h: Rename min/max to MIN/MAX

Signed-off-by: Andy Ross <[email protected]>
@andyross
Copy link
Collaborator

andyross commented Mar 7, 2019

Let's at least start with a reversion per the comments above, as that seems lowest risk. Can you guys try the linked patch to see if that resolve the issue as @cvinayak reported? If so, then that seems simple. If not we'll have to try something else.

Unfortunately I have neither a nrf51 board or iPhone (though I can probably scrounge up the former) so local reproduction is going to take some work.

@pabigot
Copy link
Collaborator

pabigot commented Mar 7, 2019

FWIW I have a very hard time believing that the irq_lock around that section of nrf_rtc_timer could be held for 105 us. That's an extremely long time; the longest operation in that sequence is an integer division that's less than 5 us, and the longest I saw the lock held was less than 20 us. So I don't think reverting just the changes to that file will resolve the issue completely.

I would believe there are other irq_lock occurrences that could be that long.

@andyross
Copy link
Collaborator

andyross commented Mar 7, 2019

I don't disagree, but at this point we need to take what we can get. @cvinayak can confirm, if it works we can merge and then fix it at our leisure.

@andyross
Copy link
Collaborator

Recent discussion in PR #14176 seems to indicate that it addresses this issue. For anyone here who might have missed that, please reveiw.

galak pushed a commit that referenced this issue Mar 15, 2019
Per #13610, recent changes to this driver seem to have introduced
unexpected latency regressions.  This patch effectively reverts these
patches which changed the meat of the driver:

ac36886 drivers: nrf: timer: add inline qualifier where
           inlining is intended
084363a drivers: timer: nrf: refactor for speed and correctness
71882ff drivers: timer: nrf: drop unnecessary counter mask
4b24e88 drivers: timer: nrf: use irq_lock instead of spinlock

While backporting these seemingly unrelated hygiene patches:

7cbdb6c drivers/timer: Restore non-tickless tick count behavior
d30c9ae drivers: nrf_power_clock: Migrate to DTS.
75f77db include: misc: util.h: Rename min/max to MIN/MAX

Signed-off-by: Andy Ross <[email protected]>
@andyross
Copy link
Collaborator

Think we can close this one. I opened #14577 to track the need for performance work in this ISR. Please add notes there as needed.

pabigot added a commit to pabigot/zephyr that referenced this issue Apr 22, 2019
The system clock on Nordic is based on 32 KiHz timer.  Scheduling ticks
requires that deadlines be specified with a timer counter that aligns to
a system clock.  With the Zephyr default 100 clocks-per-sec
configuration this results in 100 ticks every 32700 ticks of the cycle
timer.  This reveals two problems:

* The uptime clock misrepresents elapsed time because it runs 0.208%
  (68/32768) faster than the best available clock;

* Calculation of timer counter compare values often requires an integer
  division and multiply operation to produce a value that's a multiple
  of clock-ticks-per-second.

Integer division on the Cortex-M1 nRF51 is done in software with a
(value-dependent) algorithm with a non-constant runtime that can be
significant.  This can produce missed Bluetooth deadlines as discussed
in issue zephyrproject-rtos#14577 and others.

By changing the default divisor to one that evenly divides the 2^15
clock rate the time interrupts are disabled to manage timers is
significantly reduced.

Note that the central_hr configuration described in issue zephyrproject-rtos#13610 does
not distinguish latency due to timer management from other
irq_block/spinlock regions, and the maximum observed latency will still
exceed the nominal 10 us allowed maximum.  However this does occur
much less frequently than changing the timer deadline which can happen
multiple times per tick.

Signed-off-by: Peter A. Bigot <[email protected]>
pabigot added a commit to pabigot/zephyr that referenced this issue May 7, 2019
The default system clock on all Nordic devices is based on a 32 KiHz
(2^15 Hz) timer.  Scheduling ticks requires that deadlines be specified
with a timer counter that aligns to a system clock.  With the Zephyr
default 100 clocks-per-sec configuration this results in 100 ticks every
32700 ticks of the cycle timer.  This reveals two problems:

* The uptime clock misrepresents elapsed time because it runs 0.208%
  (68/32768) faster than the best available clock;

* Calculation of timer counter compare values often requires an integer
  division and multiply operation to produce a value that's a multiple
  of clock-ticks-per-second.

Integer division on the Cortex-M1 nRF51 is done in software with a
(value-dependent) algorithm with a non-constant runtime that can be
significant.  This can produce missed Bluetooth deadlines as discussed
in upstream zephyrproject-rtos#14577 and others.

By changing the default divisor to one that evenly divides the 2^15
clock rate the time interrupts are disabled to manage timers is
significantly reduced, as is the error between uptime and real time.  Do
this at the top level, moving SYS_CLOCK_HW_CYCLES_PER_SEC there as well
since the two parameters are related.

Note that the central_hr configuration described in upstream zephyrproject-rtos#13610 does
not distinguish latency due to timer management from other
irq_block/spinlock regions, and the maximum observed latency will still
exceed the nominal 10 us allowed maximum.  However this does occur
much less frequently than changing the timer deadline which can happen
multiple times per tick.

Signed-off-by: Peter A. Bigot <[email protected]>
carlescufi pushed a commit that referenced this issue May 7, 2019
The default system clock on all Nordic devices is based on a 32 KiHz
(2^15 Hz) timer.  Scheduling ticks requires that deadlines be specified
with a timer counter that aligns to a system clock.  With the Zephyr
default 100 clocks-per-sec configuration this results in 100 ticks every
32700 ticks of the cycle timer.  This reveals two problems:

* The uptime clock misrepresents elapsed time because it runs 0.208%
  (68/32768) faster than the best available clock;

* Calculation of timer counter compare values often requires an integer
  division and multiply operation to produce a value that's a multiple
  of clock-ticks-per-second.

Integer division on the Cortex-M1 nRF51 is done in software with a
(value-dependent) algorithm with a non-constant runtime that can be
significant.  This can produce missed Bluetooth deadlines as discussed
in upstream #14577 and others.

By changing the default divisor to one that evenly divides the 2^15
clock rate the time interrupts are disabled to manage timers is
significantly reduced, as is the error between uptime and real time.  Do
this at the top level, moving SYS_CLOCK_HW_CYCLES_PER_SEC there as well
since the two parameters are related.

Note that the central_hr configuration described in upstream #13610 does
not distinguish latency due to timer management from other
irq_block/spinlock regions, and the maximum observed latency will still
exceed the nominal 10 us allowed maximum.  However this does occur
much less frequently than changing the timer deadline which can happen
multiple times per tick.

Signed-off-by: Peter A. Bigot <[email protected]>
@lopsided98
Copy link
Contributor

I am running into this issue on Zephyr master (d045bd7) in my own application on the nrf51_ble400 board. This application is set up as a BLE peripheral and has quite a few features enabled, including BLE security and and custom sensor driver than uses a GPIO interrupt. I have not yet gotten a chance to try to create a minimal example that reproduces the problem. The crash only occurs when a device is connected.

This is the log just before the crash, with CONFIG_BT_CTLR_PROFILE_ISR=y. The latencies do not seem unreasonable compared to #13610 (comment).

[00:07:16.023,956] <inf> bt_ctlr_hci: l: 31, 7, 55; t: 74, 58, 111.
[00:07:16.033,355] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:20.070,220] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 80, 58, 111.
[00:07:20.083,526] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:20.119,049] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 103, 58, 111.
[00:07:20.133,361] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:23.283,477] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 95, 58, 111.
[00:07:23.287,780] <inf> bt_ctlr_hci: l: 8, 7, 55; t: 74, 58, 111.
[00:07:24.994,018] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 61, 58, 111.
[00:07:25.033,386] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:28.333,465] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 95, 58, 111.
[00:07:28.357,818] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:41.812,988] <inf> bt_ctlr_hci: l: 33, 7, 55; t: 74, 58, 111.
[00:07:41.833,343] <inf> bt_ctlr_hci: l: 7, 7, 55; t: 74, 58, 111.
[00:07:45.078,948] <err> bt_ctlr_llsw_ctrl: assert: '!radio_is_ready()' failed
***** HARD FAULT *****
***** Hardware exception *****
Current thread ID = 0x20000010
Faulting instruction address = 0x12c68
Fatal fault in ISR! Spinning...

@cvinayak
Copy link
Collaborator Author

cvinayak commented Jul 1, 2019

@lopsided98 do you have a master commit which was fine? could you check if v1.14-branch works fine for your application? I would try to bisect to find the commit beyond which the controller is seeing increased latencies. Is this reproducible in the vanilla samples/bluetooth/peripheral sample?

Thinking aloud, assuming the assert happened in an encrypted connection, the Radio ISR CPU time too seems to have increased, maybe changes in controller code too may be contributing to the increased overall CPU use. I should do some GPIO debug pin profiling to confirm the new numbers seem above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests