Skip to content

ASoC: SOF: debug: change runtime PM get/put to correct usuage #237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 14, 2018

Conversation

keyonjie
Copy link

@keyonjie keyonjie commented Nov 1, 2018

this should fix #233

ASoC: SOF: debug: change runtime PM get/put to correct usuage

When doing debugFS entry read, we should make sure it won't enter
suspend during the reading, and the reading won't interact the runtime
PM(e.g. if the idle timeout is expired when should enter runtime suspend
ASSP) neither.

Here change to use get_noresume() to make sure not resuming ADSP for
debugFS reading, and put_sync_autosuspend() to ensure we can enter
runtime suspend after reading finished if needed.

Signed-off-by: Keyon Jie [email protected]

@xiulipan
Copy link

xiulipan commented Nov 1, 2018

@keyonjie
In fail cases, the DSP are not able to power off, so this fixes could work.
when runtime pm power off the DSP, then we try to read from memory window. What will happen?

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, in the case "For audio suspend case, reading back old value is reasonable", do you mean to say values that may not be up-to-date but weren't previously read? Or that the debugs info will always be consistent but not necessarily up-to-date. Not sure what you meant by 'old' and whether the data can be inconsistent/misleading?

@lgirdwood
Copy link
Member

lgirdwood commented Nov 1, 2018

@plbossart I think "old" means the current values before the FW crash (that prevents runtime PM transitions). Iirc, a FW panic would prevent any D3 entry or core PM so that the SRAM contents should be valid here. @keyonjie can you confirm if this is what you can see.

@keyonjie
Copy link
Author

keyonjie commented Nov 2, 2018

@plbossart @lgirdwood the value depends on what is in DSP MMIO(memory window) at that moment, for DSP crashed or suspended cases, it might be the valid or invalid(e.g. 0xffffffff), but that is what we can try to read back, for analyzing.

The word "old value" is not accurate, now corrected.

@xiulipan
Copy link

xiulipan commented Nov 2, 2018

@keyonjie
Will this have regression about the error trace disappear? If we did not power up the DSP when we try to get etrace, will it show 0xffffffff as it is invalid.

@keyonjie
Copy link
Author

keyonjie commented Nov 5, 2018

@xiulipan when we try to get etrace at suspend, it might show invalid or 0xffffffff, but I don't want to call it regression.
If we want to get etrace, we can do that after ADSP is waked up again, e.g. open alsamixer. So this is only design style choice, if we want(or can bear) to disturb the normal running of ADSP when reading debugFS.

@xiulipan
Copy link

xiulipan commented Nov 5, 2018

@keyonjie
The previous design is to wake up DSP as we want to read the memory window. But every wake up will clean the memory window. So the previous run log will be cleaned after DSP suspend.
But for most debug cases, we want to check what caused the last run to exit or error, the logs are cleared by DSP power on.

For this PR, we can not get the last run log as well.

To achieve the goal to read error trace all the time, we should have more change about trace.

@keyonjie
Copy link
Author

keyonjie commented Nov 5, 2018

@xiulipan agree, this PR can't fix the error trace be cleared after resume issue.

@lgirdwood
Copy link
Member

@keyonjie @xiulipan lets have a debubFS file that can be used to enable/disable runtime PM. That way we can disable D3 entry from cmd line if needed.

@plbossart
Copy link
Member

@lgirdwood @ranj063 don't we already have existing pm_runtime hooks in /sys to disable D3 entry?

@keyonjie
Copy link
Author

keyonjie commented Nov 6, 2018

@lgirdwood @plbossart we already have that in /sys/bus/pci/device/xxx/power/control for that purpose.

@lgirdwood
Copy link
Member

@keyonjie great, lets document that then under debug in public wiki

@plbossart
Copy link
Member

@keyonjie I can't figure out if we need this patch merged, or if it contains a design flaw. Can you clarify what the status is, the discussion is not self-explanatory any longer.

@ranj063
Copy link
Collaborator

ranj063 commented Nov 6, 2018

@lgirdwood @plbossart we already have that in /sys/bus/pci/device/xxx/power/control for that purpose.

its actually /sys/devices/pci*/00*0e/power/control

@ranj063
Copy link
Collaborator

ranj063 commented Nov 6, 2018

@plbossart I think we might still need this patch to solve the problem of not being able to read debugfs when dsp panic occurs.

we do not use get/put while reading trace debugfs, so this seems OK to me.

@keyonjie
Copy link
Author

keyonjie commented Nov 8, 2018

@plbossart as Ranjani stated, I think better to apply this, it helps on dsp panic trace/debug, and no harm(or it actually improve to avoid interaction to DSP running during debugFS reading on suspend) for other cases.

@plbossart
Copy link
Member

maybe a stupid comment but wouldn't pm_runtime_get_noresume and put_autosuspend be appropriate here? Nothing would happen if the device is already sleeping or crashed.

@ranj063
Copy link
Collaborator

ranj063 commented Nov 8, 2018

@plbossart @keyonjie I think put_autosuspend will still cause errors.

But get_noresume() and put_noidle() will just increment and decrement the usage counters and it might be useful to have this to prevent the dsp from entering D3 (if it were in D0 before) during debugfs read.

@keyonjie
Copy link
Author

keyonjie commented Nov 9, 2018

@ranj063 sounds good to use get_noresume() which will help prevent dsp to enter D3,
but not suitable to use put_noidle() fore return as it won't decrease to 0 where we need it actually.

@plbossart comparing to put_autosuspend(), I think we should use put_sync_autosuspend(), the latter one will use sync mode which may lead to suspend ASAP which means our debugFS reading won't influence runtime PM too much.

@lgirdwood I think you are familiar with this, any comments?

@keyonjie keyonjie changed the title ASoC: SOF: debug: don't hold runtime PM for debugFS entries reading ASoC: SOF: debug: change runtime PM get/put to correct usuage Nov 9, 2018
@keyonjie
Copy link
Author

keyonjie commented Nov 9, 2018

@lgirdwood @plbossart @ranj063 updated as stated above, please help review.

@ranj063
Copy link
Collaborator

ranj063 commented Nov 9, 2018

@keyonjie if the dsp has crashed and you request a put_sync_autosuspend(), what happens then?

@keyonjie
Copy link
Author

keyonjie commented Nov 9, 2018

@keyonjie if the dsp has crashed and you request a put_sync_autosuspend(), what happens then?

it just put the count we got above, if it is planed to go suspend, then let it go suspend.
It don't change the behavior, that means, keep what it should do as that without the debugFS read, keep what the driver will do autosuspend or not at DSP panic.

@plbossart
Copy link
Member

There are a couple of points that are completely unclear to me.

using pm_runtime_get_noresume will only not have any effect if the DSP is in D3. If it is in D0, it will prevent entry in D3. What about when the DSP is crashed? What is the effect? Does this solve the problem that was initially reported?

When you use autosuspend you typically use mark_last_busy to provide a reference point. If you use pm_runtime_put_sync_autosuspend, when exactly will the transition to D3 take place. And what's the difference with pm_runtime_put_sync if you haven't used the mark-last-busy (which possibly makes no sense if you are already in D3)?

If we don't have the answers then I'll need experimental evidence that this patch solves the initial problem and has no side effects before merging it.

@keyonjie
Copy link
Author

There are a couple of points that are completely unclear to me.

using pm_runtime_get_noresume will only not have any effect if the DSP is in D3. If it is in D0, it will prevent entry in D3. What about when the DSP is crashed? What is the effect? Does this solve the problem that was initially reported?

Currently, if DSP is crashed, usually we are still in D0 in driver side(we may do improvement in future, e.g. unload/reload modules to try to recover, or disable runtime PM when DSP panic detected..), the patch here will solved the #233, I have tried we can read etrace at this case, we actually treat it as normal D0.

When you use autosuspend you typically use mark_last_busy to provide a reference point. If you use pm_runtime_put_sync_autosuspend, when exactly will the transition to D3 take place. And what's the difference with pm_runtime_put_sync if you haven't used the mark-last-busy (which possibly makes no sense if you are already in D3)?

I am not sure about this, I intent to don't mark-busy here which means don't change the normal audio device runtime suspend/resume per my understanding, just raise question on PM ML.

If we don't have the answers then I'll need experimental evidence that this patch solves the initial problem and has no side effects before merging it.

I had done experiment:
for DSP panic case, issue #233 is fixed, we can read etrace and other memory windows values.
for D0 case, no regression found.
for D3 case, we can also read back entries without resuming DSP.

@plbossart
Copy link
Member

When you use autosuspend you typically use mark_last_busy to provide a reference point. If you use pm_runtime_put_sync_autosuspend, when exactly will the transition to D3 take place. And what's the difference with pm_runtime_put_sync if you haven't used the mark-last-busy (which possibly makes no sense if you are already in D3)?

I am not sure about this, I intent to don't mark-busy here which means don't change the normal audio device runtime suspend/resume per my understanding, just raise question on PM ML.

Can you add a TODO comment in the code to track this is an open point so that we can merge this first and deal with this later.

@keyonjie
Copy link
Author

When you use autosuspend you typically use mark_last_busy to provide a reference point. If you use pm_runtime_put_sync_autosuspend, when exactly will the transition to D3 take place. And what's the difference with pm_runtime_put_sync if you haven't used the mark-last-busy (which possibly makes no sense if you are already in D3)?

I am not sure about this, I intent to don't mark-busy here which means don't change the normal audio device runtime suspend/resume per my understanding, just raise question on PM ML.

Can you add a TODO comment in the code to track this is an open point so that we can merge this first and deal with this later.

Sure, let me add this now.

@keyonjie keyonjie force-pushed the sof-pr branch 2 times, most recently from 68ac58a to 211d0a6 Compare November 14, 2018 01:44
When doing debugFS entry read, we should make sure it won't enter
suspend during the reading, and the reading won't interact the runtime
PM(e.g. if the idle timeout is expired when should enter runtime suspend
ASSP) neither.

Here change to use get_noresume() to make sure not resuming ADSP for
debugFS reading, and put_sync_autosuspend() to ensure we can enter
runtime suspend after reading finished if needed.

Signed-off-by: Keyon Jie <[email protected]>
@keyonjie
Copy link
Author

@plbossart updated.

@plbossart plbossart merged commit 3c1f692 into thesofproject:topic/sof-dev Nov 14, 2018
plbossart pushed a commit that referenced this pull request Mar 22, 2021
Currently without THP being enabled, MAX_ORDER via FORCE_MAX_ZONEORDER gets
reduced to 11, which falls below HUGETLB_PAGE_ORDER for certain 16K and 64K
page size configurations. This is problematic which throws up the following
warning during boot as pageblock_order via HUGETLB_PAGE_ORDER order exceeds
MAX_ORDER.

WARNING: CPU: 7 PID: 127 at mm/vmstat.c:1092 __fragmentation_index+0x58/0x70
Modules linked in:
CPU: 7 PID: 127 Comm: kswapd0 Not tainted 5.12.0-rc1-00005-g0221e3101a1 #237
Hardware name: linux,dummy-virt (DT)
pstate: 20400005 (nzCv daif +PAN -UAO -TCO BTYPE=--)
pc : __fragmentation_index+0x58/0x70
lr : fragmentation_index+0x88/0xa8
sp : ffff800016ccfc00
x29: ffff800016ccfc00 x28: 0000000000000000
x27: ffff800011fd4000 x26: 0000000000000002
x25: ffff800016ccfda0 x24: 0000000000000002
x23: 0000000000000640 x22: ffff0005ffcb5b18
x21: 0000000000000002 x20: 000000000000000d
x19: ffff0005ffcb3980 x18: 0000000000000004
x17: 0000000000000001 x16: 0000000000000019
x15: ffff800011ca7fb8 x14: 00000000000002b3
x13: 0000000000000000 x12: 00000000000005e0
x11: 0000000000000003 x10: 0000000000000080
x9 : ffff800011c93948 x8 : 0000000000000000
x7 : 0000000000000000 x6 : 0000000000007000
x5 : 0000000000007944 x4 : 0000000000000032
x3 : 000000000000001c x2 : 000000000000000b
x1 : ffff800016ccfc10 x0 : 000000000000000d
Call trace:
__fragmentation_index+0x58/0x70
compaction_suitable+0x58/0x78
wakeup_kcompactd+0x8c/0xd8
balance_pgdat+0x570/0x5d0
kswapd+0x1e0/0x388
kthread+0x154/0x158
ret_from_fork+0x10/0x30

This solves the problem via keeping FORCE_MAX_ZONEORDER unchanged with or
without THP on 16K and 64K page size configurations, making sure that the
HUGETLB_PAGE_ORDER (and pageblock_order) would never exceed MAX_ORDER.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Anshuman Khandual <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
aiChaoSONG pushed a commit to aiChaoSONG/linux that referenced this pull request May 6, 2021
plbossart pushed a commit that referenced this pull request Jul 31, 2023
Currently there is no synchronisation between finalize_pkvm() and
kvm_arm_init() initcalls. The finalize_pkvm() proceeds happily even if
kvm_arm_init() fails resulting in the following warning on all the CPUs
and eventually a HYP panic:

  | kvm [1]: IPA Size Limit: 48 bits
  | kvm [1]: Failed to init hyp memory protection
  | kvm [1]: error initializing Hyp mode: -22
  |
  | <snip>
  |
  | WARNING: CPU: 0 PID: 0 at arch/arm64/kvm/pkvm.c:226 _kvm_host_prot_finalize+0x30/0x50
  | Modules linked in:
  | CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0 #237
  | Hardware name: FVP Base RevC (DT)
  | pstate: 634020c5 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  | pc : _kvm_host_prot_finalize+0x30/0x50
  | lr : __flush_smp_call_function_queue+0xd8/0x230
  |
  | Call trace:
  |  _kvm_host_prot_finalize+0x3c/0x50
  |  on_each_cpu_cond_mask+0x3c/0x6c
  |  pkvm_drop_host_privileges+0x4c/0x78
  |  finalize_pkvm+0x3c/0x5c
  |  do_one_initcall+0xcc/0x240
  |  do_initcall_level+0x8c/0xac
  |  do_initcalls+0x54/0x94
  |  do_basic_setup+0x1c/0x28
  |  kernel_init_freeable+0x100/0x16c
  |  kernel_init+0x20/0x1a0
  |  ret_from_fork+0x10/0x20
  | Failed to finalize Hyp protection: -22
  |     dtb=fvp-base-revc.dtb
  | kvm [95]: nVHE hyp BUG at: arch/arm64/kvm/hyp/nvhe/mem_protect.c:540!
  | kvm [95]: nVHE call trace:
  | kvm [95]:  [<ffff800081052984>] __kvm_nvhe_hyp_panic+0xac/0xf8
  | kvm [95]:  [<ffff800081059644>] __kvm_nvhe_handle_host_mem_abort+0x1a0/0x2ac
  | kvm [95]:  [<ffff80008105511c>] __kvm_nvhe_handle_trap+0x4c/0x160
  | kvm [95]:  [<ffff8000810540fc>] __kvm_nvhe___skip_pauth_save+0x4/0x4
  | kvm [95]: ---[ end nVHE call trace ]---
  | kvm [95]: Hyp Offset: 0xfffe8db00ffa0000
  | Kernel panic - not syncing: HYP panic:
  | PS:a34023c9 PC:0000f250710b973c ESR:00000000f2000800
  | FAR:ffff000800cb00d0 HPFAR:000000000880cb00 PAR:0000000000000000
  | VCPU:0000000000000000
  | CPU: 3 PID: 95 Comm: kworker/u16:2 Tainted: G        W          6.4.0 #237
  | Hardware name: FVP Base RevC (DT)
  | Workqueue: rpciod rpc_async_schedule
  | Call trace:
  |  dump_backtrace+0xec/0x108
  |  show_stack+0x18/0x2c
  |  dump_stack_lvl+0x50/0x68
  |  dump_stack+0x18/0x24
  |  panic+0x138/0x33c
  |  nvhe_hyp_panic_handler+0x100/0x184
  |  new_slab+0x23c/0x54c
  |  ___slab_alloc+0x3e4/0x770
  |  kmem_cache_alloc_node+0x1f0/0x278
  |  __alloc_skb+0xdc/0x294
  |  tcp_stream_alloc_skb+0x2c/0xf0
  |  tcp_sendmsg_locked+0x3d0/0xda4
  |  tcp_sendmsg+0x38/0x5c
  |  inet_sendmsg+0x44/0x60
  |  sock_sendmsg+0x1c/0x34
  |  xprt_sock_sendmsg+0xdc/0x274
  |  xs_tcp_send_request+0x1ac/0x28c
  |  xprt_transmit+0xcc/0x300
  |  call_transmit+0x78/0x90
  |  __rpc_execute+0x114/0x3d8
  |  rpc_async_schedule+0x28/0x48
  |  process_one_work+0x1d8/0x314
  |  worker_thread+0x248/0x474
  |  kthread+0xfc/0x184
  |  ret_from_fork+0x10/0x20
  | SMP: stopping secondary CPUs
  | Kernel Offset: 0x57c5cb460000 from 0xffff800080000000
  | PHYS_OFFSET: 0x80000000
  | CPU features: 0x00000000,1035b7a3,ccfe773f
  | Memory Limit: none
  | ---[ end Kernel panic - not syncing: HYP panic:
  | PS:a34023c9 PC:0000f250710b973c ESR:00000000f2000800
  | FAR:ffff000800cb00d0 HPFAR:000000000880cb00 PAR:0000000000000000
  | VCPU:0000000000000000 ]---

Fix it by checking for the successfull initialisation of kvm_arm_init()
in finalize_pkvm() before proceeding any futher.

Fixes: 87727ba ("KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege")
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: James Morse <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Zenghui Yu <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

could not access debugfs when DSP panic or IPC failed
5 participants