Skip to content

Adding push2/pop2 #116035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented May 27, 2025

PR Overview

This PR does the following

  1. Enable Push2 and Pop2 instructions
  2. Enable PPX features for Push/Pop/Push2/Pop2
  3. Modify function epilog/prolog to use push2/pop2 and PPX

APX and PPX

As part of Intel APX(Intel Advanced Performance Extensions), a couple of new features are available for working with stack

  • PUSH2/POP2 instructions that transfer two register values within a single memory operation.
  • PPX (push-pop acceleration) : A PPX hint that helps processor tracks these new instructions internally and fast-forwards register data between matching PUSH2 and POP2 instructions without going through memory. This is also applicable for PUSH/POP with REX2 encoding

This write up with be focused on push2/pop2.

PUSH2/POP2

PUSH2 and POP2 are two new instructions for (respectively) pushing/popping 2 GPRs at a time to/from
the stack. These instructions use eEVEX encoding. The data being pushed/popped by PUSH2/POP2 must be 16B-aligned on the stack.

Guidance from Intel

It’s not part of the spec but in current implementations, push2/pop2 should really only be used with PPX hints and thus should only be used in matching “pairs”. i.e.

push2.p
…
pop2.p

Unwind code

Windows does not current support unwind for push2. After discussion with Kunal, I decided to use 2 unwwind_push() to simulate push2. This will need to be updated later once we have support

Testing done

  1. Emitter unit tests added and checked to verify encoding
  2. superpmi ran with APX enabled

Superpmi result with/without PPX feature

Diffs are based on 2,602,472 contexts (1,012,864 MinOpts, 1,589,608 FullOpts).

MISSED contexts: 17 (0.00%)

Base JIT options: JitBypassApxCheck=1

Diff JIT options: EnableApxPPX=1;JitBypassApxCheck=1

Overall (+15,581,843 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs Base Instruction Count Diff Instruction Count
benchmarks.run.windows.x64.checked.mch 12,265,429 +449,098 -1.75% 3080529 -86,249(-2.98%)(-3.32%)
benchmarks.run_pgo.windows.x64.checked.mch 65,974,296 +1,060,625 -2.64% 15365972 -224,793(-2.16%)(-2.36%)
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 12,617,955 +461,447 -1.74% 3167977 -88,298(-2.97%)(-3.31%)
coreclr_tests.run.windows.x64.checked.mch 410,382,821 +3,285,229 -1.94% 85035916 -676,431(-2.26%)(-2.46%)
libraries.pmi.windows.x64.checked.mch 57,910,005 +2,456,153 -2.00% 14674111 -420,750(-3.09%)(-3.57%)
libraries_tests.run.windows.x64.Release.mch 350,075,962 +3,750,526 -2.46% 76683341 -772,586(-1.97%)(-2.06%)
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 153,338,819 +3,692,459 -1.64% 35431157 -672,354(-2.09%)(-2.31%)
realworld.run.windows.x64.checked.mch 11,691,157 +378,585 -1.94% 2834926 -74,551(-2.77%)(-3.00%)
smoke_tests.nativeaot.windows.x64.checked.mch 5,507,705 +47,721 -1.89% 1544803 -8,336(-3.58%)(-4.29%)
MinOpts (+527,206 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs Base Instruction Count Diff Instruction Count
benchmarks.run_pgo.windows.x64.checked.mch 18,993,001 +33,556 -2.73% 4248074 -9,369(-2.75%)(-2.75%)
coreclr_tests.run.windows.x64.checked.mch 280,800,379 +190,604 -0.29% 56348876 -28,091(-0.91%)(-1.15%)
libraries_tests.run.windows.x64.Release.mch 198,087,087 +267,130 -1.01% 41226776 -59,783(-1.05%)(-1.08%)
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 10,686,322 +35,916 -0.07% 2487522 -403(-0.13%)(-0.29%)
FullOpts (+15,054,637 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs Base Instruction Count Diff Instruction Count
benchmarks.run.windows.x64.checked.mch 12,264,751 +449,098 -1.75% 3080340 -86,249(-2.98%)(-3.32%)
benchmarks.run_pgo.windows.x64.checked.mch 46,981,295 +1,027,069 -2.64% 11117898 -215,424(-2.14%)(-2.34%)
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 12,617,253 +461,447 -1.74% 3167770 -88,298(-2.97%)(-3.31%)
coreclr_tests.run.windows.x64.checked.mch 129,582,442 +3,094,625 -2.22% 28687040 -648,340(-2.41%)(-2.59%)
libraries.pmi.windows.x64.checked.mch 57,797,220 +2,456,153 -2.00% 14653829 -420,750(-3.09%)(-3.57%)
libraries_tests.run.windows.x64.Release.mch 151,988,875 +3,483,396 -2.75% 35456565 -712,803(-2.13%)(-2.23%)
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 142,652,497 +3,656,543 -1.70% 32943635 -671,951(-2.11%)(-2.32%)
realworld.run.windows.x64.checked.mch 11,466,278 +378,585 -1.94% 2799763 -74,551(-2.77%)(-3.00%)
smoke_tests.nativeaot.windows.x64.checked.mch 5,506,552 +47,721 -1.89% 1544498 -8,336(-3.58%)(-4.29%)

Sample diff

-6 (-17.65%) : 17249.dasm - System.Globalization.CalendarData:LoadCalendarDataFromSystemCore(System.String,ushort):bool:this (FullOpts)
@@ -42,13 +42,10 @@
 
 G_M54418_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        push     rbp
-       push     r15
-       push     r14
-       push     r13
-       push     r12
-       push     rdi
-       push     rsi
-       push     rbx
+       push2p   r15, r14
+       push2p   r13, r12
+       push2p   rdi, rsi
+       pushp    rbx
        sub      rsp, 104
        lea      rbp, [rsp+0xA0]
        mov      rbx, rcx
@@ -56,7 +53,7 @@ G_M54418_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
        mov      rsi, rdx
        ; gcrRegs +[rsi]
        mov      edi, r8d
-						;; size=33 bbWeight=1 PerfScore 9.50
+						;; size=43 bbWeight=1 PerfScore 6.50
 G_M54418_IG02:        ; bbWeight=1, gcrefRegs=0048 {rbx rsi}, byrefRegs=0000 {}, byref
        lea      rcx, [rbp-0x78]
        call     CORINFO_HELP_INIT_PINVOKE_FRAME
@@ -75,18 +72,15 @@ G_M54418_IG02:        ; bbWeight=1, gcrefRegs=0048 {rbx rsi}, byrefRegs=0000 {},
 						;; size=42 bbWeight=1 PerfScore 8.00
 G_M54418_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      rsp, 104
-       pop      rbx
-       pop      rsi
-       pop      rdi
-       pop      r12
-       pop      r13
-       pop      r14
-       pop      r15
+       popp     rbx
+       pop2p    rsi, rdi
+       pop2p    r12, r13
+       pop2p    r14, r15
        pop      rbp
        ret      
-						;; size=17 bbWeight=1 PerfScore 5.25
+						;; size=27 bbWeight=1 PerfScore 5.25
 
-; Total bytes of code 92, prolog size 24, PerfScore 22.75, instruction count 34, allocated bytes for code 92 (MethodHash=8e392b6d) for method System.Globalization.CalendarData:LoadCalendarDataFromSystemCore(System.String,ushort):bool:this (FullOpts)
+; Total bytes of code 112, prolog size 34, PerfScore 19.75, instruction count 28, allocated bytes for code 112 (MethodHash=8e392b6d) for method System.Globalization.CalendarData:LoadCalendarDataFromSystemCore(System.String,ushort):bool:this (FullOpts)
 ; ============================================================
 
 Unwind Info:
@@ -94,17 +88,17 @@ Unwind Info:
   >>   End offset   : 0xd1ffab1e (not in unwind data)
   Version           : 1
   Flags             : 0x00
-  SizeOfProlog      : 0x10
+  SizeOfProlog      : 0x1A
   CountOfUnwindCodes: 9
   FrameRegister     : none (0)
   FrameOffset       : N/A (no FrameRegister) (Value=0)
   UnwindCodes       :
-    CodeOffset: 0x10 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 12 * 8 + 8 = 104 = 0x68
-    CodeOffset: 0x0C UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbx (3)
-    CodeOffset: 0x0B UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
-    CodeOffset: 0x0A UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
-    CodeOffset: 0x09 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r12 (12)
-    CodeOffset: 0x07 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r13 (13)
-    CodeOffset: 0x05 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r14 (14)
-    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r15 (15)
+    CodeOffset: 0x1A UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 12 * 8 + 8 = 104 = 0x68
+    CodeOffset: 0x16 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbx (3)
+    CodeOffset: 0x13 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
+    CodeOffset: 0x13 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
+    CodeOffset: 0x0D UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r12 (12)
+    CodeOffset: 0x0D UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r13 (13)
+    CodeOffset: 0x07 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r14 (14)
+    CodeOffset: 0x07 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r15 (15)
     CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)
-6 (-16.67%) : 7885.dasm - Microsoft.Extensions.Logging.LoggerMessage+<>c__DisplayClass12_0`2[int,System.__Canon]:b__1(Microsoft.Extensions.Logging.ILogger,int,System.__Canon,System.Exception):this (FullOpts)
@@ -17,11 +17,9 @@
 ; Lcl frame size = 32
 
 G_M15345_IG01:        ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
-       push     r14
-       push     rdi
-       push     rsi
-       push     rbp
-       push     rbx
+       pushp    rbp
+       push2p   r14, rdi
+       push2p   rsi, rbx
        sub      rsp, 32
        mov      rbx, rcx
        ; gcrRegs +[rbx]
@@ -30,7 +28,7 @@ G_M15345_IG01:        ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {
        mov      ebp, r8d
        mov      rdi, r9
        ; gcrRegs +[rdi]
-						;; size=22 bbWeight=1 PerfScore 6.25
+						;; size=31 bbWeight=1 PerfScore 4.25
 G_M15345_IG02:        ; bbWeight=1, gcrefRegs=00C8 {rbx rsi rdi}, byrefRegs=0000 {}, byref, isz
        mov      edx, dword ptr [rbx+0x10]
        mov      rcx, rsi
@@ -45,13 +43,11 @@ G_M15345_IG02:        ; bbWeight=1, gcrefRegs=00C8 {rbx rsi rdi}, byrefRegs=0000
 G_M15345_IG03:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc
        ; gcrRegs -[rbx rsi rdi]
        add      rsp, 32
-       pop      rbx
-       pop      rbp
-       pop      rsi
-       pop      rdi
-       pop      r14
+       pop2p    rbx, rsi
+       pop2p    rdi, r14
+       popp     rbp
        ret      
-						;; size=11 bbWeight=0.50 PerfScore 1.88
+						;; size=20 bbWeight=0.50 PerfScore 1.88
 G_M15345_IG04:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=00C8 {rbx rsi rdi}, byrefRegs=0000 {}, gcvars, byref, nogc
        ; gcrRegs +[rbx rsi rdi]
        mov      r14, gword ptr [rsp+0x70]
@@ -67,15 +63,13 @@ G_M15345_IG04:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=00C
 						;; size=22 bbWeight=0.50 PerfScore 1.50
 G_M15345_IG05:        ; bbWeight=0.50, epilog, nogc, extend
        add      rsp, 32
-       pop      rbx
-       pop      rbp
-       pop      rsi
-       pop      rdi
-       pop      r14
+       pop2p    rbx, rsi
+       pop2p    rdi, r14
+       popp     rbp
        tail.jmp [Microsoft.Extensions.Logging.LoggerMessage+<>c__DisplayClass12_0`2[int,System.__Canon]:<Define>g__Log|0(Microsoft.Extensions.Logging.ILogger,int,System.__Canon,System.Exception):this]
-						;; size=16 bbWeight=0.50 PerfScore 2.38
+						;; size=25 bbWeight=0.50 PerfScore 2.38
 
-; Total bytes of code 94, prolog size 10, PerfScore 18.75, instruction count 36, allocated bytes for code 94 (MethodHash=e9a4c40e) for method Microsoft.Extensions.Logging.LoggerMessage+<>c__DisplayClass12_0`2[int,System.__Canon]:<Define>b__1(Microsoft.Extensions.Logging.ILogger,int,System.__Canon,System.Exception):this (FullOpts)
+; Total bytes of code 121, prolog size 19, PerfScore 16.75, instruction count 30, allocated bytes for code 121 (MethodHash=e9a4c40e) for method Microsoft.Extensions.Logging.LoggerMessage+<>c__DisplayClass12_0`2[int,System.__Canon]:<Define>b__1(Microsoft.Extensions.Logging.ILogger,int,System.__Canon,System.Exception):this (FullOpts)
 ; ============================================================
 
 Unwind Info:
@@ -83,14 +77,14 @@ Unwind Info:
   >>   End offset   : 0xd1ffab1e (not in unwind data)
   Version           : 1
   Flags             : 0x00
-  SizeOfProlog      : 0x0A
+  SizeOfProlog      : 0x13
   CountOfUnwindCodes: 6
   FrameRegister     : none (0)
   FrameOffset       : N/A (no FrameRegister) (Value=0)
   UnwindCodes       :
-    CodeOffset: 0x0A UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
-    CodeOffset: 0x06 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbx (3)
-    CodeOffset: 0x05 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)
-    CodeOffset: 0x04 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
-    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
-    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r14 (14)
+    CodeOffset: 0x13 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
+    CodeOffset: 0x0F UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbx (3)
+    CodeOffset: 0x0F UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
+    CodeOffset: 0x09 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
+    CodeOffset: 0x09 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r14 (14)
+    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)
-9 (-16.67%) : 13852.dasm - System.Linq.Enumerable+ArrayWhereSelectIterator`2[System.__Canon,System.__Canon]:GetCount(bool,System.ReadOnlySpan`1[System.__Canon],System.Func`2[System.__Canon,bool],System.Func`2[System.__Canon,System.__Canon]):int (FullOpts)
@@ -28,19 +28,16 @@
 ; Lcl frame size = 32
 
 G_M19035_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-       push     r15
-       push     r14
-       push     r13
-       push     rdi
-       push     rsi
-       push     rbp
-       push     rbx
+       pushp    rbp
+       push2p   r15, r14
+       push2p   r13, rdi
+       push2p   rsi, rbx
        sub      rsp, 32
        mov      rbx, r9
        ; gcrRegs +[rbx]
        mov      rsi, gword ptr [rsp+0x80]
        ; gcrRegs +[rsi]
-						;; size=25 bbWeight=1 PerfScore 8.50
+						;; size=36 bbWeight=1 PerfScore 5.50
 G_M19035_IG02:        ; bbWeight=1, gcrefRegs=0048 {rbx rsi}, byrefRegs=0100 {r8}, byref, isz
        ; byrRegs +[r8]
        test     dl, dl
@@ -94,36 +91,30 @@ G_M19035_IG08:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byr
 						;; size=2 bbWeight=0.50 PerfScore 0.12
 G_M19035_IG09:        ; bbWeight=0.50, epilog, nogc, extend
        add      rsp, 32
-       pop      rbx
-       pop      rbp
-       pop      rsi
-       pop      rdi
-       pop      r13
-       pop      r14
-       pop      r15
+       pop2p    rbx, rsi
+       pop2p    rdi, r13
+       pop2p    r14, r15
+       popp     rbp
        ret      
-						;; size=15 bbWeight=0.50 PerfScore 2.38
+						;; size=26 bbWeight=0.50 PerfScore 2.38
 G_M19035_IG10:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
        mov      eax, -1
 						;; size=5 bbWeight=0.50 PerfScore 0.12
 G_M19035_IG11:        ; bbWeight=0.50, epilog, nogc, extend
        add      rsp, 32
-       pop      rbx
-       pop      rbp
-       pop      rsi
-       pop      rdi
-       pop      r13
-       pop      r14
-       pop      r15
+       pop2p    rbx, rsi
+       pop2p    rdi, r13
+       pop2p    r14, r15
+       popp     rbp
        ret      
-						;; size=15 bbWeight=0.50 PerfScore 2.38
+						;; size=26 bbWeight=0.50 PerfScore 2.38
 G_M19035_IG12:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
        call     CORINFO_HELP_OVERFLOW
        ; gcr arg pop 0
        int3     
 						;; size=6 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 131, prolog size 14, PerfScore 70.03, instruction count 54, allocated bytes for code 131 (MethodHash=2c44b5a4) for method System.Linq.Enumerable+ArrayWhereSelectIterator`2[System.__Canon,System.__Canon]:GetCount(bool,System.ReadOnlySpan`1[System.__Canon],System.Func`2[System.__Canon,bool],System.Func`2[System.__Canon,System.__Canon]):int (FullOpts)
+; Total bytes of code 164, prolog size 25, PerfScore 67.03, instruction count 45, allocated bytes for code 164 (MethodHash=2c44b5a4) for method System.Linq.Enumerable+ArrayWhereSelectIterator`2[System.__Canon,System.__Canon]:GetCount(bool,System.ReadOnlySpan`1[System.__Canon],System.Func`2[System.__Canon,bool],System.Func`2[System.__Canon,System.__Canon]):int (FullOpts)
 ; ============================================================
 
 Unwind Info:
@@ -131,16 +122,16 @@ Unwind Info:
   >>   End offset   : 0xd1ffab1e (not in unwind data)
   Version           : 1
   Flags             : 0x00
-  SizeOfProlog      : 0x0E
+  SizeOfProlog      : 0x19
   CountOfUnwindCodes: 8
   FrameRegister     : none (0)
   FrameOffset       : N/A (no FrameRegister) (Value=0)
   UnwindCodes       :
-    CodeOffset: 0x0E UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
-    CodeOffset: 0x0A UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbx (3)
-    CodeOffset: 0x09 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)
-    CodeOffset: 0x08 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
-    CodeOffset: 0x07 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
-    CodeOffset: 0x06 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r13 (13)
-    CodeOffset: 0x04 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r14 (14)
-    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r15 (15)
+    CodeOffset: 0x19 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 3 * 8 + 8 = 32 = 0x20
+    CodeOffset: 0x15 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbx (3)
+    CodeOffset: 0x15 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
+    CodeOffset: 0x0F UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
+    CodeOffset: 0x0F UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r13 (13)
+    CodeOffset: 0x09 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r14 (14)
+    CodeOffset: 0x09 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: r15 (15)
+    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 27, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 27, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@risc-vv
Copy link

risc-vv commented Jun 10, 2025

@dotnet/samsung Could you please take a look? These changes may be related to riscv64.

@risc-vv
Copy link

risc-vv commented Jun 10, 2025

RISC-V Release-CLR-QEMU: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 35h 15min 56s 106ms
   REAL time: 36min 1s 374ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9083 / 9113 (99.67%)
=======================
      passed: 9083
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9712
VIRTUAL time: 10h 29min 58s 721ms
   REAL time: 42min 53s 335ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 283571 / 284631 (99.63%)
=======================
      passed: 283571
      failed: 1054
     skipped: 38
      killed: 6
------------------------
 TOTAL tests: 284669
VIRTUAL time: 29h 43min 24s 559ms
   REAL time: 1h 11min 41s 861ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-VF2: 305223 / 306944 (99.44%)
=======================
      passed: 305223
      failed: 1711
     skipped: 38
      killed: 10
------------------------
 TOTAL tests: 306982
VIRTUAL time: 20h 34min 21s 342ms
   REAL time: 2h 13min 5s 272ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 55928e12d732d9b877f2eb289899c3327ab54c6e
CI: 985b32219c9d1164ea1a09421e4f004672ee8c85
REPO: DeepakRajendrakumaran/runtime
BRANCH: push2
CONFIG: Release
LIB_CONFIG: Release

@risc-vv
Copy link

risc-vv commented Jun 10, 2025

RISC-V Release-CLR-QEMU: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 35h 17min 30s 631ms
   REAL time: 36min 8s 592ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9083 / 9113 (99.67%)
=======================
      passed: 9083
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9712
VIRTUAL time: 10h 42min 44s 749ms
   REAL time: 43min 40s 937ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 283315 / 284395 (99.62%)
=======================
      passed: 283315
      failed: 1075
     skipped: 38
      killed: 5
------------------------
 TOTAL tests: 284433
VIRTUAL time: 29h 40min 46s 127ms
   REAL time: 1h 12min 16s 450ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-VF2: 299368 / 301100 (99.42%)
=======================
      passed: 299368
      failed: 1722
     skipped: 38
      killed: 10
------------------------
 TOTAL tests: 301138
VIRTUAL time: 20h 36min 26s 576ms
   REAL time: 2h 6min 6s 96ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 585402f9ce53e2b8439107d618faaa7beb7e5c83
CI: 8fec31ecd11a1e652234ca5056ed447368abd635
REPO: DeepakRajendrakumaran/runtime
BRANCH: push2
CONFIG: Release
LIB_CONFIG: Release

@DeepakRajendrakumaran DeepakRajendrakumaran force-pushed the push2 branch 3 times, most recently from 66245b1 to 68903ee Compare June 12, 2025 15:42
@risc-vv
Copy link

risc-vv commented Jun 13, 2025

RISC-V Release-CLR-QEMU: 9084 / 9114 (99.67%)
=======================
      passed: 9084
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9713
VIRTUAL time: 35h 15min 0s 697ms
   REAL time: 36min 1s 908ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 598
      killed: 28
------------------------
 TOTAL tests: 9710
VIRTUAL time: 10h 54min 26s 26ms
   REAL time: 44min 42s 847ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 68903ee1b404ee05645075c19df6b0bbd85650b6
CI: ad349d0f0dd61055dacdfc98d8ad42963a159890
REPO: DeepakRajendrakumaran/runtime
BRANCH: push2
CONFIG: Release
LIB_CONFIG: Release

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

RISC-V Release-CLR-QEMU: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 35h 24min 9s 599ms
   REAL time: 36min 17s 85ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 10h 30min 43s 763ms
   REAL time: 42min 54s 850ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 260891 / 261953 (99.59%)
=======================
      passed: 260891
      failed: 1055
     skipped: 38
      killed: 7
------------------------
 TOTAL tests: 261991
VIRTUAL time: 28h 56min 56s 487ms
   REAL time: 1h 10min 16s 90ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-VF2: 299817 / 301551 (99.42%)
=======================
      passed: 299817
      failed: 1725
     skipped: 38
      killed: 9
------------------------
 TOTAL tests: 301589
VIRTUAL time: 20h 34min 38s 342ms
   REAL time: 2h 13min 37s 522ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 908cedbd2ca658404f1dee3e67c9e2dd6d09bff1
CI: ff51305a244b14413d8afd782debae409d7468b8
REPO: DeepakRajendrakumaran/runtime
BRANCH: push2
CONFIG: Release
LIB_CONFIG: Release

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

RISC-V Release-CLR-QEMU: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 35h 19min 32s 281ms
   REAL time: 36min 11s 159ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 283658 / 284736 (99.62%)
=======================
      passed: 283658
      failed: 1072
     skipped: 38
      killed: 6
------------------------
 TOTAL tests: 284774
VIRTUAL time: 29h 14min 48s 566ms
   REAL time: 1h 10min 39s 555ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 10h 38min 31s 107ms
   REAL time: 43min 26s 788ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: bc308aa4f12df2ae3ef31e47f06127fdbbb6c005
CI: ff51305a244b14413d8afd782debae409d7468b8
REPO: DeepakRajendrakumaran/runtime
BRANCH: push2
CONFIG: Release
LIB_CONFIG: Release

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

6b2c687 is being scheduled for building and testing

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
REPO: DeepakRajendrakumaran/runtime
BRANCH: push2

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

RISC-V Release-CLR-QEMU: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 35h 22min 17s 113ms
   REAL time: 36min 17s 533ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
CI: ff51305a244b14413d8afd782debae409d7468b8
REPO: DeepakRajendrakumaran/runtime
BRANCH: push2
CONFIG: Release
LIB_CONFIG: Release

@DeepakRajendrakumaran DeepakRajendrakumaran changed the title Draft : Adding push2/pop2 Adding push2/pop2 Jun 16, 2025
@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as ready for review June 16, 2025 18:53
@Copilot Copilot AI review requested due to automatic review settings June 16, 2025 18:53
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces support for the new PUSH2/POP2 instructions and PPX features, updating the unwind, instruction, and code generation infrastructure accordingly.

  • Adds new unwind functions (unwindPush2, unwindPush2Pop2CFI, etc.) and APX PPX handling for various architectures.
  • Updates configuration, instruction definitions, emitter logic, and unit tests to integrate the new instructions.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/coreclr/jit/unwindx86.cpp Added an empty unwindPush2() function; may need implementation to simulate push2.
src/coreclr/jit/unwindriscv64.cpp Added unwindPush2() with an unreached() call as a placeholder for RISCV64.
src/coreclr/jit/unwindloongarch64.cpp Added unwindPush2() with unreached() for LoongArch64.
src/coreclr/jit/unwindarm64.cpp Added unwindPush2() with unreached() for ARM64.
src/coreclr/jit/unwindamd64.cpp Implements unwindPush2() for AMD64, including a Windows-specific path using two pushes.
src/coreclr/jit/unwind.cpp Added unwindPush2Pop2CFI() as a placeholder for CFI unwind handling for push2/pop2.
src/coreclr/jit/jitconfigvalues.h Introduces the EnableApxPPX configuration integer.
src/coreclr/jit/instrsxarch.h Defines push2 and pop2 instructions with APX-related flags.
src/coreclr/jit/instr.h Adds new insOpts related to APX PPX hint.
src/coreclr/jit/instr.cpp Updates instruction display and generation routines to handle new APX/PPX context.
src/coreclr/jit/emitxarch.h & .cpp Adds helper functions for setting APX PPX context during instruction emission.
src/coreclr/jit/emit.h & .cpp Adds new member functions and the HasApxPpx() helper for APX PPX functionality.
src/coreclr/jit/compiler.h Declares new unwindPush2 and unwindPush2Windows functions.
src/coreclr/jit/codegenxarch.cpp Integrates push2/pop2 emission for unit tests and updates prolog/epilog register handling.
src/coreclr/jit/codegen.h Declares the instGen_Push2Pop2Ppx function and APX-specific pop routines.

@@ -50,6 +50,10 @@ void Compiler::unwindPush(regNumber reg)
{
}

void Compiler::unwindPush2(regNumber reg1, regNumber reg2)
{
Copy link
Preview

Copilot AI Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of unwindPush2 in unwindx86.cpp is empty; consider simulating push2 by calling unwindPush() for each register to match the intent stated in the PR description.

Suggested change
{
{
unwindPush(reg1);
unwindPush(reg2);

Copilot uses AI. Check for mistakes.

@DeepakRajendrakumaran
Copy link
Contributor Author

@kunalspathak @EgorBo This PR is ready for review

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

6b2c687 is being scheduled for building and testing

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
REPO: dotnet/runtime
BRANCH: main

1 similar comment
@risc-vv
Copy link

risc-vv commented Jun 16, 2025

6b2c687 is being scheduled for building and testing

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
REPO: dotnet/runtime
BRANCH: main

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

RISC-V Release-FX-QEMU: 261199 / 262260 (99.60%)
=======================
      passed: 261199
      failed: 1052
     skipped: 39
      killed: 9
------------------------
 TOTAL tests: 262299
VIRTUAL time: 30h 35min 6s 851ms
   REAL time: 1h 10min 6s 981ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
CI: ff51305a244b14413d8afd782debae409d7468b8
REPO: dotnet/runtime
BRANCH: main
CONFIG: Release
LIB_CONFIG: Release

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

6b2c687 is being scheduled for building and testing

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
REPO: dotnet/runtime
BRANCH: main

1 similar comment
@risc-vv
Copy link

risc-vv commented Jun 16, 2025

6b2c687 is being scheduled for building and testing

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
REPO: dotnet/runtime
BRANCH: main

@risc-vv
Copy link

risc-vv commented Jun 16, 2025

RISC-V Release-CLR-QEMU: 9082 / 9112 (99.67%)
=======================
      passed: 9082
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9711
VIRTUAL time: 35h 19min 17s 575ms
   REAL time: 36min 9s 812ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-QEMU: 283661 / 284755 (99.62%)
=======================
      passed: 283661
      failed: 1087
     skipped: 39
      killed: 7
------------------------
 TOTAL tests: 284794
VIRTUAL time: 31h 13min 22s 447ms
   REAL time: 1h 10min 55s 87ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-CLR-VF2: 9083 / 9113 (99.67%)
=======================
      passed: 9083
      failed: 2
     skipped: 599
      killed: 28
------------------------
 TOTAL tests: 9712
VIRTUAL time: 10h 42min 10s 797ms
   REAL time: 43min 38s 61ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

RISC-V Release-FX-VF2: 307376 / 309101 (99.44%)
=======================
      passed: 307376
      failed: 1715
     skipped: 39
      killed: 10
------------------------
 TOTAL tests: 309140
VIRTUAL time: 20h 12min 29s 691ms
   REAL time: 2h 7min 2s 527ms
=======================

report.xml, report.md, failures.xml, testclr_details.tar.zst

Build information and commands

GIT: 6b2c687e251543c10e560dc72a5c03721faabb32
CI: ff51305a244b14413d8afd782debae409d7468b8
REPO: dotnet/runtime
BRANCH: main
CONFIG: Release
LIB_CONFIG: Release

{
switch (ins)
{
case INS_push:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why INS_push and INS_pop is also in this? I was expecting just push2/pop2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INS_push and INS_pop has a PPX feature that's part of APX

PPX (push-pop acceleration) : A PPX hint that helps processor tracks these new instructions internally and fast-forwards register data between matching PUSH2 and POP2 instructions without going through memory. This is also applicable for PUSH/POP with REX2 encoding

@@ -3782,6 +3782,20 @@ const IS_INFO emitter::emitGetSchedInfo(insFormat insFmt)
assert(!"Unsupported insFmt");
return IS_NONE;
}

bool emitter::HasApxPpx(instruction ins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add summary docs explaining what is the scenario in which this method is used and what inference we can make from it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -230,6 +230,11 @@ void Compiler::unwindPush(regNumber reg)
unreached(); // use one of the unwindSaveReg* functions instead.
}

void Compiler::unwindPush2(regNumber reg1, regNumber reg2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of touching other platform files, can you just add unwindPush2 for TARGET_AMD64 itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change

// This is not a funclet or an On-Stack Replacement.
assert((compiler->funCurrentFunc()->funKind == FuncKind::FUNC_ROOT) && !compiler->opts.IsOSR());
// PUSH2 doesn't work for ESP.
assert((rsPushRegs & RBM_SPBASE) == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why this code is not in its own method genPushCalleeSavedRegistersFromMaskAPX similar to genPopCalleeSavedRegistersFromMaskAPX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a genPushCalleeSavedRegistersFromMaskAPX and moved code there

}

int index = 0;
if (regStack.Height() % 2 == 1)
Copy link
Member

@kunalspathak kunalspathak Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (regStack.Height() % 2 == 1)

I think the answer is NO, but just want to double check -

does it matter (mainly perfwise) if you should use push2/pop2 for same set of registers or is it ok to use different pattern?

push2 rax, rbx
push rcx
...
..
pop2 rcx, rbx
pop rax

instead of

push2 rax, rbx
push rcx
...
..
pop rcx
pop rbx, rax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it does matter

  1. The data being pushed/popped by PUSH2/POP2 must be 16B-aligned on the stack. This is why I'm always emitting a single push before any 'push2' and a single pop before function end and after any pop2'. alignReg` is used to determine which register is used for this
  2. I'm using the PPX feature whenever I'm using push2/pop2. This is because the current guidance is to use push2/pop2 only with PPX
A PUSH and its corresponding POP may be marked with a 1-bit
Push-Pop Acceleration (PPX) hint to indicate that the POP reads the value written by the PUSH from the
stack. The processor tracks these marked instructions internally and fast-forwards register data between
matching PUSH and POP instructions, without going through memory or through the training loop of the
Fast Store Forwarding Predictor (FSFP).
When applying the PPX hint, the compiler needs to make sure that it always marks both the PUSH and its
matching POP (i.e., the POP which reads from the same stack memory address that the PUSH writes to).
This balancing rule naturally applies to PUSH/POP sequences in function prologs/epilogs, respectively. It
does not apply to standalone PUSH sequences, such as function argument pushes onto the stack. Such
sequences should not be marked with the PPX hint

}
else
{
assert((instOptions & INS_OPTS_APX_ppx_MASK) == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is always true because it is inside else of the opposite condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed else

@@ -20309,6 +20341,15 @@ emitter::insExecutionCharacteristics emitter::getInsExecutionCharacteristics(ins
}
break;

case INS_push2:
// TODO-XArch-APX: to be verified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we verify this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked internally and this is the guidance I received. https://uops.info/table.html doesn't have these new instructions yet. That's why I marked it as a ToDo

// id - instruction descriptor
// instOptions - emit options
//
void SetApxPpxIfNeeded(instrDesc* id, insOpts instOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetApxPpxIfNeeded

since there are just 4 instructions that will have this set, can we also assert on them to make sure that we do not call this method for different instructions or at least we hit the assert when we call it for newer instructions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(HasApxPpx(id->idIns())) asserts id it's not any of the 4 instructions. Did you mean a difference check?

// Setting EVEX.W = 1 bit indicates a push-pop acceleration (PPX) hint
// The current recommendation is to use PUSH2/POP2 only with PPX hint
// So, it is used only in Epilog/Prolog code generation
if (id->idIsApxPpxContextSet())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it always true for push2 and pop2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push2/pop2 are new instructions. PPX is a feature which push'/pop/pop2/push2can use. So theoretically, we might have apush2/pop2withoutPPX enabled. We currently only use push2/pop2 with PPX since that's the guidance(the guidance is for performance reasons) but that doesn't mean push2/pop2 cannot be used without PPX

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

Superpmi result with/without PPX feature

Did you get a chance to check if all the perfscore is coming just from prolog and epilog? Also, is it because we have not setup the perf latency/throughput numbers accurately (I see a TODO there and added a comment about it) and hence perfscore is just because of reduced in number of instructions?

@DeepakRajendrakumaran
Copy link
Contributor Author

Added some comments.

Superpmi result with/without PPX feature

Did you get a chance to check if all the perfscore is coming just from prolog and epilog? Also, is it because we have not setup the perf latency/throughput numbers accurately (I see a TODO there and added a comment about it) and hence perfscore is just because of reduced in number of instructions?

I did check and it's entirely coming from epilog/prolog. See examples here). The perfscore improvement is due to reduced number of instructions.

The guidance I got was TP/Latency should be the same as regular push and pop. as long as PPX hint is used. I'll update the TP/Latency numbers based on the ToDo once Agner/uops adds them if it's different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants