Skip to content

pmac32_defconfig boot failure after LLVM commit 1bd0b82e508d049efdb07f4f8a342f35818df341 #1770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nathanchance opened this issue Dec 13, 2022 · 17 comments
Labels
[ARCH] powerpc This bug impacts ARCH=powerpc [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) [BUG] Untriaged Something isn't working

Comments

@nathanchance
Copy link
Member

I just bisected a boot failure with ARCH=powerpc pmac32_defconfig to commit 1bd0b82e508d ("[SimplifyCFG] FoldBranchToCommonDest(): deal with mismatched IV's in PHI's in common successor block") in LLVM:

$ make -skj$(nproc) ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 LLVM_IAS=0 mrproper pmac32_defconfig

# For proper serial console output
$ scripts/config -e SERIAL_PMACZILOG -e SERIAL_PMACZILOG_CONSOLE

$ make -skj$(nproc) ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 LLVM_IAS=0 olddefconfig vmlinux

$ .../boot-qemu.py -a ppc32_mac -k vmlinux -t 1m

I do not see any console output from the kernel so it appears to be hanging before that point.

@nathanchance nathanchance added [BUG] Untriaged Something isn't working [ARCH] powerpc This bug impacts ARCH=powerpc [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) labels Dec 13, 2022
@nickdesaulniers
Copy link
Member

Thanks for the report. I'm able to repro.'
I'm having a bit of trouble with GDB_SCRIPTS ATM.

But we are not hitting a breakpoint set on start_kernel after llvm/llvm-project@1bd0b82. So we probably have a failure in the arch specific startup code before we jump to the shared common starting point. Though I also don't hit a breakpoint on _start, and gdb crashes when I pause execution.

@nickdesaulniers
Copy link
Member

working around this, I get the following stack trace out of GDB:

(gdb) bt
#0  0xfff0b33c in ?? ()
#1  0xfff0b3a0 in ?? ()
#2  0xfff0cbf0 in ?? ()
#3  0xfff0d180 in ?? ()
#4  0xfff0e824 in ?? ()
#5  0xfff0268c in ?? ()

There are literally no symbols in that range in vmlinux. cc @mpe in case @mpe has any insights to 32b ppc early boot

@mpe
Copy link

mpe commented Dec 15, 2022

Those 0xfff addresses look like they're still in OpenBIOS, not the kernel.

If you look at a good boot log you should be able to see that the kernel is loaded somewhere in memory, for me it's 0x1000000. The kernel initially runs at that address, before moving itself down to 0x0.

You can tell gdb about that and get symbols to resolve at those addresses.

The early boot is a bit hairy, it jumps to the kernel, then back into the bios and back to the kernel. You probably want to skip that, try setting a breakpoint at early_init.

eg:

(gdb) target remote :1234
Remote debugging using :1234
0xfff00100 in ?? ()
(gdb) add-symbol-file vmlinux -s .head.text 0x1000000  -s .text 0x1004000
add symbol table from file "vmlinux" at
        .head.text_addr = 0x1000000
        .text_addr = 0x1004000
(y or n) y
Reading symbols from vmlinux...
(gdb) b early_init
Breakpoint 1 at 0x1a20c48: early_init. (2 locations)
(gdb) c
Continuing.

Breakpoint 1, early_init (dt_ptr=32202752) at /home/michael/linux/arch/powerpc/kernel/early_32.c:21
21              unsigned long kva, offset = reloc_offset();
(gdb)

If you're not even getting that far you can set a breakpoint at 0x1000000 (b *0x1000000), to confirm if you're getting out of the bios at all.

@nickdesaulniers
Copy link
Member

Haha, thanks @mpe ! I figured you'd be the one to ask!

output from a known good build:
https://paste.debian.net/1264308/
In particular:
[ppc] Kernel already loaded (0x01000000 + 0x00c196c0) (initrd 0x01d1a000 + 0x0046a600)

gdb from a known good build:

(gdb) add-symbol-file vmlinux -s .head.text 0x1000000  -s .text 0x1004000
add symbol table from file "vmlinux" at
        .head.text_addr = 0x1000000
        .text_addr = 0x1004000
(y or n) y
Reading symbols from vmlinux...
(gdb) b early_init
Breakpoint 1 at 0x1a09a04: early_init. (2 locations)
(gdb) info b
Num     Type           Disp Enb Address    What
1       breakpoint     keep y   <MULTIPLE> 
1.1                         y   0x01a09a04 in early_init at arch/powerpc/kernel/early_32.c:21
1.2                         y   0xc0a09a04 in early_init at arch/powerpc/kernel/early_32.c:21
(gdb) c
Continuing.

Breakpoint 1, early_init (dt_ptr=35147776) at arch/powerpc/kernel/early_32.c:21
21              unsigned long kva, offset = reloc_offset();
(gdb) bt
#0  early_init (dt_ptr=35147776) at arch/powerpc/kernel/early_32.c:21
#1  0x01000050 in __start () at arch/powerpc/kernel/head_book3s_32.S:144
#2  0x01000028 in __start () at arch/powerpc/kernel/head_book3s_32.S:116
(gdb) info register pc
pc             0x1a09a04           0x1a09a04 <early_init+24>

For a bad build, I never hit early_init

(gdb) add-symbol-file vmlinux -s .head.text 0x1000000  -s .text 0x1004000
add symbol table from file "vmlinux" at
        .head.text_addr = 0x1000000
        .text_addr = 0x1004000
(y or n) y
Reading symbols from vmlinux...
(gdb) b early_init
Breakpoint 1 at 0x1a0da64: early_init. (2 locations)
(gdb) c
Continuing.
<inf loop>

If you're not even getting that far you can set a breakpoint at 0x1000000 (b *0x1000000), to confirm if you're getting out of the bios at all.

(gdb) add-symbol-file vmlinux -s .head.text 0x1000000  -s .text 0x1004000
add symbol table from file "vmlinux" at
        .head.text_addr = 0x1000000
        .text_addr = 0x1004000
(y or n) y
Reading symbols from vmlinux...
(gdb) b *0x1000000
Breakpoint 1 at 0x1000000: file arch/powerpc/kernel/head_book3s_32.S, line 66.
(gdb) c
Continuing.

Breakpoint 1, _text () at arch/powerpc/kernel/head_book3s_32.S:66
66              nop     /* used by __secondary_hold on prep (mtx) and chrp smp */
(gdb) bt
#0  _text () at arch/powerpc/kernel/head_book3s_32.S:66
#1  0xfff0817c in ?? ()
(gdb) info address _text
Symbol "_text" is at 0xc0000000 in a file compiled without debugging.
(gdb) si
Cannot access memory at address 0x1000000
(gdb) info register pc
pc             0x1000000           0x1000000 <_text>

(I can use ctrl+x,a to view source (CONFIG_DEBUG_INFO enabled))

I think that address remapping is throwing gdb off. IIUC si (step instruction) is trying to access 0x1000000 rather than 0xc0000000?

If I set a breakpoint on __start and si, then I get to bl prom_init before things go haywire (Cannot access memory at address 0x1a04004). Looks like I can set breakpoints in the destination and hit them, it's just si that's having issues with bl.

The offending commit got reverted in llvm, but it's likely to be back soon, and might still affect us.
https://reviews.llvm.org/rG37b8f09a4b61bf9bf9d0b9017d790c8b82be2e17

@nickdesaulniers
Copy link
Member

Cannot access memory at address 0x1000000

It seems if I exit TUI I can step...

prom_init in arch/powerpc/kernel/prom_init.c seems curious. __start in arch/powerpc/kernel/head_book3s_32.S seems to jump to prom_init which does work then jumps back to __start, IIUC. I only see __start hit the second time, eventually (qemu really slows to a crawl...)

Comparing disassembly of arch/powerpc/kernel/prom_init.c, the following symbols have non-trivial changes:

  • prom_init_mem
  • call_prom
  • alloc_up
  • scan_dt_build_struct

Where the differences in call_prom seem the most significant. call_prom is variadic, I wonder if this va_start/va_end handling is going awry?

If I swap in a bad

$ make -kj$(nproc) ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 LLVM_IAS=0 CC=clang-bad arch/powerpc/kernel/prom_init.o
$ cp arch/powerpc/kernel/prom_init.o prom_init.o.bad
$ make -kj$(nproc) ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 LLVM_IAS=0 CC=cla
ng-good clean all
$ cp prom_init.o.bad arch/powerpc/kernel/prom_init.o
$ make -kj$(nproc) ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 LLVM_IAS=0 CC=cla
ng-good all

the kernel fails to boot.

Yeah, just takes a long time to hit these breakpoints.

Waiting to see if I ever hit a breakpoint in bootx_init or early_init.

@LebedevRI
Copy link

Please retry with current LLVM main.
I have reasons to believe that llvm/llvm-project@07ecdd9 squashed the miscompile,
so reland (llvm/llvm-project@428f364) should be seamless.
If not, i'll need another repro :)

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Dec 17, 2022

Sorry for the bad news, @LebedevRI, I can still repro the boot failure.

Linux version 6.1.0-rc8+ ([email protected]) (clang version 16.0.0 ([email protected]:llvm/llvm-project.git 428f36401b1b695fd501ebfdc8773bed8ced8d4e), LLD 16.0.0) #20 Fri Dec 16 19:52:00 PST 2022


If I swap in a bad

Doing the same experiment in reverse (use clang-bad for whole kernel, clang-good for prom_init.o), the kernel boots!

$ make -kj$(nproc) ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 LLVM_IAS=0 CC=clang-bad clean all
$ cp prom_init.o.good arch/powerpc/kernel/prom_init.o
$ make -kj$(nproc) ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 LLVM_IAS=0 CC=clang-bad all

so I think that isolates the failure down to arch/powerpc/kernel/prom_init.c

diffing IR, I see diffs in

  • prom_init_mem
  • call_prom (seems to drop a call to @prom_entry IIUC?)
  • alloc_up
  • scan_dt_build_struct (seems to add a call to @call_prom IICU)

@LebedevRI
Copy link

@nickdesaulniers ok.
Can you do a few more experiments please: on top of the current main,

  1. revert simplifycfg patch llvm/llvm-project@428f364, AND revert instcombine patch llvm/llvm-project@9ddff66
  2. revert only the simplifycfg patch llvm/llvm-project@428f364
  3. revert only the instcombine patch llvm/llvm-project@9ddff66
    Which ones lead to broken kernel?

@nickdesaulniers
Copy link
Member

prom_init.i.txt attaching preprocessed source from 428f36401b1b695fd501ebfdc8773bed8ced8d4e. The first line has a comment that should be removed and used as the command line. Running above tests now.

@nickdesaulniers
Copy link
Member

requested tests' results:

  1. 428f36401b1b + revert 428f36401b1b + revert 9ddff66: boots
  2. 428f36401b1b + revert 428f36401b1b: boots
  3. 428f36401b1b + revert 9ddff66: does not boot

@mpe
Copy link

mpe commented Dec 18, 2022

prom_init in arch/powerpc/kernel/prom_init.c seems curious. __start in arch/powerpc/kernel/head_book3s_32.S seems to jump to prom_init which does work then jumps back to __start, IIUC.

That's expected.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Dec 20, 2022

I've tried moving call_prom to a different translation unit, so that I can drop in object files compiled with different compiler revisions. The result fails to boot though (with a known good compiler rev, which is bad+unexpected).

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 9b6146056e48..45a606099703 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -11,6 +11,8 @@ CFLAGS_prom_init.o      += -fPIC
 CFLAGS_btext.o         += -fPIC
 endif
 
+CFLAGS_call_prom.o +=-fPIC -fno-stack-protector -ffreestanding -ftrivial-auto-var-init=uninitialized
+
 CFLAGS_early_32.o += $(DISABLE_LATENT_ENTROPY_PLUGIN)
 CFLAGS_cputable.o += $(DISABLE_LATENT_ENTROPY_PLUGIN)
 CFLAGS_prom_init.o += $(DISABLE_LATENT_ENTROPY_PLUGIN)
@@ -200,14 +202,17 @@ CFLAGS_paca.o                     += -fno-stack-protector
 obj-$(CONFIG_PPC_FPU)          += fpu.o
 obj-$(CONFIG_ALTIVEC)          += vector.o
 obj-$(CONFIG_PPC64)            += entry_64.o
-obj-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE)   += prom_init.o
+obj-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE)   += prom_init.o call_prom.o
 
 extra-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init_check
 
+$(obj)/prom_hybrid.o: $(obj)/prom_init.o $(obj)/call_prom.o
+       $(LD) -r $? -o $@
+
 quiet_cmd_prom_init_check = PROMCHK $@
-      cmd_prom_init_check = $(CONFIG_SHELL) $< "$(NM)" $(obj)/prom_init.o; touch $@
+      cmd_prom_init_check = $(CONFIG_SHELL) $< "$(NM)" $(obj)/prom_hybrid.o; touch $@
 
-$(obj)/prom_init_check: $(src)/prom_init_check.sh $(obj)/prom_init.o FORCE
+$(obj)/prom_init_check: $(src)/prom_init_check.sh $(obj)/prom_hybrid.o FORCE
        $(call if_changed,prom_init_check)
 targets += prom_init_check

(then the definition of call_prom moved from arch/powerpc/kernel/prom_init.c to arch/powerpc/kernel/call_prom.c)
🤷‍♂️

In qemu+gdb, we get to the va_start() processing in prom_printf(), then jump off to 0x300 somehow which I don't understand). FWICT in gdb stepping instructions at a time, we go from lbz 3, 0(29) to 0x300...doesn't make sense...

(curiously, both prom_printf() and it's callee call_prom are variadic, and operate on va_lists in a loop. I wonder if that's relevant or a red herring)

@mpe
Copy link

mpe commented Dec 21, 2022

I've tried moving call_prom to a different translation unit, so that I can drop in object files compiled with different compiler revisions. The result fails to boot though (with a known good compiler rev, which is bad+unexpected).

I can boot OK with call_prom in a separate compilation unit using this patch and building with clang 13/binutils 2.37.

In qemu+gdb, we get to the va_start() processing in prom_printf(), then jump off to 0x300 somehow which I don't understand). FWICT in gdb stepping instructions at a time, we go from lbz 3, 0(29) to 0x300...doesn't make sense...

0x300 is the page fault handler, so presumably you're accessing some bad address, contained in r29.

@LebedevRI
Copy link

In qemu+gdb, we get to the va_start() processing in prom_printf(), then jump off to 0x300 somehow which I don't understand). FWICT in gdb stepping instructions at a time, we go from lbz 3, 0(29) to 0x300...doesn't make sense...

0x300 is the page fault handler, so presumably you're accessing some bad address, contained in r29.

Can you show a call stack for this? From which function did we call prom_printf()?
IIRC it's not being changed by the patches in question?

@nickdesaulniers
Copy link
Member

Sorry for the delay here...took two weeks off for the holidays.

So I tried llvm @ 9e08b083a09ef4e02fb0a4de2c0d3ddc0eccadde with 3a8e009f97d2ed4b751365e514e0bf7131c7ab11 reverted (i.e. restoring [SimplifyCFG] FoldBranchToCommonDest(): deal with mismatched IV's in PHI's in common successor block). I still hit the boot failure. :(

Next, I will try @mpe 's patch (thanks!) to see if I can isolate which function perhaps is problematic. (I will try the functions listed in #1770 (comment)).

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Jan 13, 2023

I can boot OK with call_prom in a separate compilation unit using this patch and building with clang 13/binutils 2.37.

Hmm...that patch fails to boot for me with clang 16/binutils 2.39. That toolchain combo seems to work otherwise without that kernel patch applied. I don't see anything obviously wrong with the kernel patch...

In qemu+gdb, we get to the va_start() processing in prom_printf(), then jump off to 0x300 somehow which I don't understand).

The same thing happens in fact.

0x300 is the page fault handler, so presumably you're accessing some bad address, contained in r29.

FWICT, this seems to be the format parameter to prom_printf, which should be the string literal "OF stdout device is: %s\n". We seem to hit the page fault handler for both the booting and non booting case. In the non-booting case, it seems that we never hit prom_setprop after the initial call to prom_printf.

In the booting case, in prom_printf we hit breakpoints in call_prom twice. In the non-booting case, we hit it only once.

In the non booting case, we enter call_prom from call_prom from prom_printf. We have:

mtctr r3
mr r3,r29
bctrl
cmpwi r3,0
...

once we execute bctrl we jump off to somewhere else and a breakpoint on the cmpwi is never hit; in the booting case it is. Backtrace looks like:

(gdb) bt
#0  0x01a2f9b8 in enter_prom (args=0x7de7f08, entry=4293928400) at arch/powerpc/kernel/call_prom.c:21
#1  call_prom (service=<optimized out>, nargs=<optimized out>, nret=1) at arch/powerpc/kernel/call_prom.c:46
#2  0x01a25784 in prom_printf (format=<optimized out>) at arch/powerpc/kernel/prom_init.c:464
#3  0x01a25508 in prom_init_stdout () at arch/powerpc/kernel/prom_init.c:2232
#4  0x01a25078 in prom_init (r3=30515200, r4=4630016, pp=<optimized out>, r6=<optimized out>, r7=<optimized out>, 
    kbase=16777216) at arch/powerpc/kernel/prom_init.c:3283
#5  0x01000028 in __start () at arch/powerpc/kernel/head_book3s_32.S:118

TIL PPC has a branch "control register" (CTR). bctrl is "unconditionally branch and link to the control register." I didn't look up the neumonic for mtctr but it seems to be moving its register operand's contents into ctr IIUC.

In the booting case before bctrl:

(gdb) info register ctr
ctr            0xfff025d0          4293928400

(Which has the same value later at the breakpoint on cmpwi). The same value is in ctr for the non booting case.

So it seems like whatever is at 0xfff025d0 we never "return" from. I'm not sure that symbol is in vmlinux:

$ llvm-addr2line 0xfff025d0 --obj=vmlinux
??:0
$ llvm-readelf -s vmlinux| grep fff025d0
$

non-booting case:

(gdb) p args
$5 = {service = 26547945, nargs = 3, nret = 1, args = {0, 26429198, 21, 0, 0, 26740056, 3, 1, 0, 29241388}}

booting case:

(gdb) p args
$2 = {service = 26547945, nargs = 3, nret = 1, args = {0, 26429198, 21, 0, 0, 26740056, 3, 1, 0, 29241388}}

So the same values. In hex:

(gdb) p/x args.service
$9 = 0x19516e9
(gdb) p/x args.args[1]
$10 = 0x193470e
(gdb) p/x args.args[3]
$11 = 0xffffffff
(gdb) p/x args.args[5]
$12 = 0x1980558

In the booting case, I noticed a slight difference of assembler:

  mtctr r3
-  mr r3,r29
  bctrl
  cmpwi r3,0
...

(but we have the same value in ctr when we execute bctrl, so perhaps that's a red herring).

@mpe any idea what's going on? It looks like with a known good compiler rev + your patch that one of the calls into the prom via bctrl with the same arguments, but we never return to the subsequent instruction after the bctrl. Perhaps something is going wrong within the prom?

In gdb I have no idea how you print what variable is in a register, or given a variable what register is it in. DWARF definitely has this info encoded. Let's read DWARF by hand!

$ llvm-objdump -Dr --no-show-raw-insn --disassemble-symbols=call_prom arch/powerpc/kernel/call_prom.o    
...
     110:       mtctr 3               ; address is offset 0x110
     114:       mr      3, 29
     118:       bctrl
     11c:       cmpwi   3, 0
...
$ llvm-dwarfdump arch/powerpc/kernel/call_prom.o
...
0x00000111:       DW_TAG_formal_parameter
                    DW_AT_location      (indexed (0x5) loclist = 0x00000094: 
                       [0x00000110, 0x00000118): DW_OP_reg3 R3      ; starting at address offset 0x110
                       [0x00000140, 0x0000014c): DW_OP_reg4 R4)
                    DW_AT_abstract_origin       (0x000000af "entry")   ; is variable "entry"
...

So if IIUC we:

  1. move entry into ctr via mtctr r3
  2. branch to it via bctrl.

Seems right to me. So why don't we ever return from the prom?


Copy+paste these gdb commands again for myself, so I don't have to keep scrolling up to get them:

add-symbol-file vmlinux -s .head.text 0x1000000 -s .text 0x1004000
b *0x1000000
b prom_init
b prom_init_stdout
b prom_printf
layout asm
layout src
b *0x300
b call_prom
b prom_setprop

@LebedevRI
Copy link

@alexfh FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] powerpc This bug impacts ARCH=powerpc [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) [BUG] Untriaged Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants