Skip to content

[NVPTX] add combiner rule for final packed op in reduction #143943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Prince781
Copy link
Contributor

For vector reductions, the final result needs to be a scalar. The default
expansion will use packed ops (ex. fadd.f16x2) even for the final operation.
This requires a packed operation where one of the lanes is undef.

ex: lowering of vecreduce_fadd(V) where V = v4f16<a b c d>

v1: v2f16 = fadd reassoc v2f16<a b>, v2f16<c d>  (== <a+c     b+d>)
v2: v2f16 = vector_shuffle<1,u> v1, undef:v2f16  (== <b+d     undef>)
v3: v2f16 = fadd reassoc v2, v1                  (== <b+d+a+c undef>)
vR: f16   = extractelt v3, 1

We wish to replace vR, v3, and v2 with:

vR: f16 = fadd reassoc (extractelt v1, 1) (extractelt v1, 0)

...so that we get:

v1: v2f16 = fadd reassoc v2f16<a b>, v2f16<c d>  (== <a+c     b+d>)
s1: f16   = extractelt v1, 1
s2: f16   = extractelt v1, 0
vR: f16   = fadd reassoc s1, s2                  (== b+d+a+c)

So for this example, this rule will replace v3 and v2, returning a vector
with the result in lane 0 and an undef in lane 1, which we expect will be
folded into the extractelt in vR.

@llvmbot
Copy link
Member

llvmbot commented Jun 12, 2025

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-nvptx

Author: Princeton Ferro (Prince781)

Changes

For vector reductions, the final result needs to be a scalar. The default
expansion will use packed ops (ex. fadd.f16x2) even for the final operation.
This requires a packed operation where one of the lanes is undef.

ex: lowering of vecreduce_fadd(V) where V = v4f16&lt;a b c d&gt;

v1: v2f16 = fadd reassoc v2f16&lt;a b&gt;, v2f16&lt;c d&gt;  (== &lt;a+c     b+d&gt;)
v2: v2f16 = vector_shuffle&lt;1,u&gt; v1, undef:v2f16  (== &lt;b+d     undef&gt;)
v3: v2f16 = fadd reassoc v2, v1                  (== &lt;b+d+a+c undef&gt;)
vR: f16   = extractelt v3, 1

We wish to replace vR, v3, and v2 with:

vR: f16 = fadd reassoc (extractelt v1, 1) (extractelt v1, 0)

...so that we get:

v1: v2f16 = fadd reassoc v2f16&lt;a b&gt;, v2f16&lt;c d&gt;  (== &lt;a+c     b+d&gt;)
s1: f16   = extractelt v1, 1
s2: f16   = extractelt v1, 0
vR: f16   = fadd reassoc s1, s2                  (== b+d+a+c)

So for this example, this rule will replace v3 and v2, returning a vector
with the result in lane 0 and an undef in lane 1, which we expect will be
folded into the extractelt in vR.


Patch is 27.57 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/143943.diff

2 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+108-6)
  • (modified) llvm/test/CodeGen/NVPTX/reduction-intrinsics.ll (+102-238)
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index d6a134d9abafd..7e36e5b526932 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -843,6 +843,13 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
   if (STI.allowFP16Math() || STI.hasBF16Math())
     setTargetDAGCombine(ISD::SETCC);
 
+  // Combine reduction operations on packed types (e.g. fadd.f16x2) with vector
+  // shuffles when one of their lanes is a no-op.
+  if (STI.allowFP16Math() || STI.hasBF16Math())
+    // already added above: FADD, ADD, AND
+    setTargetDAGCombine({ISD::FMUL, ISD::FMINIMUM, ISD::FMAXIMUM, ISD::UMIN,
+                         ISD::UMAX, ISD::SMIN, ISD::SMAX, ISD::OR, ISD::XOR});
+
   // Promote fp16 arithmetic if fp16 hardware isn't available or the
   // user passed --nvptx-no-fp16-math. The flag is useful because,
   // although sm_53+ GPUs have some sort of FP16 support in
@@ -5059,20 +5066,102 @@ static SDValue PerformStoreRetvalCombine(SDNode *N) {
   return PerformStoreCombineHelper(N, 2, 0);
 }
 
+/// For vector reductions, the final result needs to be a scalar. The default
+/// expansion will use packed ops (ex. fadd.f16x2) even for the final operation.
+/// This requires a packed operation where one of the lanes is undef.
+///
+/// ex: lowering of vecreduce_fadd(V) where V = v4f16<a b c d>
+///
+/// v1: v2f16 = fadd reassoc v2f16<a b>, v2f16<c d>  (== <a+c     b+d>)
+/// v2: v2f16 = vector_shuffle<1,u> v1, undef:v2f16  (== <b+d     undef>)
+/// v3: v2f16 = fadd reassoc v2, v1                  (== <b+d+a+c undef>)
+/// vR: f16   = extractelt v3, 1
+///
+/// We wish to replace vR, v3, and v2 with:
+/// vR: f16 = fadd reassoc (extractelt v1, 1) (extractelt v1, 0)
+///
+/// ...so that we get:
+/// v1: v2f16 = fadd reassoc v2f16<a b>, v2f16<c d>  (== <a+c     b+d>)
+/// s1: f16   = extractelt v1, 1
+/// s2: f16   = extractelt v1, 0
+/// vR: f16   = fadd reassoc s1, s2                  (== a+c+b+d)
+///
+/// So for this example, this rule will replace v3 and v2, returning a vector
+/// with the result in lane 0 and an undef in lane 1, which we expect will be
+/// folded into the extractelt in vR.
+static SDValue PerformPackedOpCombine(SDNode *N,
+                                      TargetLowering::DAGCombinerInfo &DCI) {
+  // Convert:
+  // (fop.x2 (vector_shuffle<i,u> A), B) -> ((fop A:i, B:0), undef)
+  // ...or...
+  // (fop.x2 (vector_shuffle<u,i> A), B) -> (undef, (fop A:i, B:1))
+  // ...where i is a valid index and u is poison.
+  const EVT VectorVT = N->getValueType(0);
+  if (!Isv2x16VT(VectorVT))
+    return SDValue();
+
+  SDLoc DL(N);
+
+  SDValue ShufOp = N->getOperand(0);
+  SDValue VectOp = N->getOperand(1);
+  bool Swapped = false;
+
+  // canonicalize shuffle to op0
+  if (VectOp.getOpcode() == ISD::VECTOR_SHUFFLE) {
+    std::swap(ShufOp, VectOp);
+    Swapped = true;
+  }
+
+  if (ShufOp.getOpcode() != ISD::VECTOR_SHUFFLE)
+    return SDValue();
+
+  auto *ShuffleOp = cast<ShuffleVectorSDNode>(ShufOp);
+  int LiveLane; // exclusively live lane
+  for (LiveLane = 0; LiveLane < 2; ++LiveLane) {
+    // check if the current lane is live and the other lane is dead
+    if (ShuffleOp->getMaskElt(LiveLane) != PoisonMaskElem &&
+        ShuffleOp->getMaskElt(!LiveLane) == PoisonMaskElem)
+      break;
+  }
+  if (LiveLane == 2)
+    return SDValue();
+
+  int ElementIdx = ShuffleOp->getMaskElt(LiveLane);
+  const EVT ScalarVT = VectorVT.getScalarType();
+  SDValue Lanes[2] = {};
+  for (auto [LaneID, LaneVal] : enumerate(Lanes)) {
+    if (LaneID == (unsigned)LiveLane) {
+      SDValue Operands[2] = {
+          DCI.DAG.getExtractVectorElt(DL, ScalarVT, ShufOp.getOperand(0),
+                                      ElementIdx),
+          DCI.DAG.getExtractVectorElt(DL, ScalarVT, VectOp, LiveLane)};
+      // preserve the order of operands
+      if (Swapped)
+        std::swap(Operands[0], Operands[1]);
+      LaneVal = DCI.DAG.getNode(N->getOpcode(), DL, ScalarVT, Operands);
+    } else {
+      LaneVal = DCI.DAG.getUNDEF(ScalarVT);
+    }
+  }
+  return DCI.DAG.getBuildVector(VectorVT, DL, Lanes);
+}
+
 /// PerformADDCombine - Target-specific dag combine xforms for ISD::ADD.
 ///
 static SDValue PerformADDCombine(SDNode *N,
                                  TargetLowering::DAGCombinerInfo &DCI,
                                  CodeGenOptLevel OptLevel) {
-  if (OptLevel == CodeGenOptLevel::None)
-    return SDValue();
-
   SDValue N0 = N->getOperand(0);
   SDValue N1 = N->getOperand(1);
 
   // Skip non-integer, non-scalar case
   EVT VT = N0.getValueType();
-  if (VT.isVector() || VT != MVT::i32)
+  if (VT.isVector())
+    return PerformPackedOpCombine(N, DCI);
+  if (VT != MVT::i32)
+    return SDValue();
+
+  if (OptLevel == CodeGenOptLevel::None)
     return SDValue();
 
   // First try with the default operand order.
@@ -5092,7 +5181,10 @@ static SDValue PerformFADDCombine(SDNode *N,
   SDValue N1 = N->getOperand(1);
 
   EVT VT = N0.getValueType();
-  if (VT.isVector() || !(VT == MVT::f32 || VT == MVT::f64))
+  if (VT.isVector())
+    return PerformPackedOpCombine(N, DCI);
+
+  if (!(VT == MVT::f32 || VT == MVT::f64))
     return SDValue();
 
   // First try with the default operand order.
@@ -5195,7 +5287,7 @@ static SDValue PerformANDCombine(SDNode *N,
     DCI.CombineTo(N, Val, AddTo);
   }
 
-  return SDValue();
+  return PerformPackedOpCombine(N, DCI);
 }
 
 static SDValue PerformREMCombine(SDNode *N,
@@ -5676,6 +5768,16 @@ SDValue NVPTXTargetLowering::PerformDAGCombine(SDNode *N,
       return PerformADDCombine(N, DCI, OptLevel);
     case ISD::FADD:
       return PerformFADDCombine(N, DCI, OptLevel);
+    case ISD::FMUL:
+    case ISD::FMINNUM:
+    case ISD::FMAXIMUM:
+    case ISD::UMIN:
+    case ISD::UMAX:
+    case ISD::SMIN:
+    case ISD::SMAX:
+    case ISD::OR:
+    case ISD::XOR:
+      return PerformPackedOpCombine(N, DCI);
     case ISD::MUL:
       return PerformMULCombine(N, DCI, OptLevel);
     case ISD::SHL:
diff --git a/llvm/test/CodeGen/NVPTX/reduction-intrinsics.ll b/llvm/test/CodeGen/NVPTX/reduction-intrinsics.ll
index d5b451dad7bc3..ca03550bdefcd 100644
--- a/llvm/test/CodeGen/NVPTX/reduction-intrinsics.ll
+++ b/llvm/test/CodeGen/NVPTX/reduction-intrinsics.ll
@@ -5,10 +5,10 @@
 ; RUN: %if ptxas-12.8 %{ llc < %s -mcpu=sm_80 -mattr=+ptx70 -O0 \
 ; RUN:      -disable-post-ra -verify-machineinstrs \
 ; RUN: | %ptxas-verify -arch=sm_80 %}
-; RUN: llc < %s -mcpu=sm_100 -mattr=+ptx87 -O0 \
+; RUN: llc < %s -mcpu=sm_100 -mattr=+ptx86 -O0 \
 ; RUN:      -disable-post-ra -verify-machineinstrs \
 ; RUN: | FileCheck -check-prefixes CHECK,CHECK-SM100 %s
-; RUN: %if ptxas-12.8 %{ llc < %s -mcpu=sm_100 -mattr=+ptx87 -O0 \
+; RUN: %if ptxas-12.8 %{ llc < %s -mcpu=sm_100 -mattr=+ptx86 -O0 \
 ; RUN:      -disable-post-ra -verify-machineinstrs \
 ; RUN: | %ptxas-verify -arch=sm_100 %}
 target triple = "nvptx64-nvidia-cuda"
@@ -43,45 +43,22 @@ define half @reduce_fadd_half(<8 x half> %in) {
 }
 
 define half @reduce_fadd_half_reassoc(<8 x half> %in) {
-; CHECK-SM80-LABEL: reduce_fadd_half_reassoc(
-; CHECK-SM80:       {
-; CHECK-SM80-NEXT:    .reg .b16 %rs<6>;
-; CHECK-SM80-NEXT:    .reg .b32 %r<10>;
-; CHECK-SM80-EMPTY:
-; CHECK-SM80-NEXT:  // %bb.0:
-; CHECK-SM80-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_fadd_half_reassoc_param_0];
-; CHECK-SM80-NEXT:    add.rn.f16x2 %r5, %r2, %r4;
-; CHECK-SM80-NEXT:    add.rn.f16x2 %r6, %r1, %r3;
-; CHECK-SM80-NEXT:    add.rn.f16x2 %r7, %r6, %r5;
-; CHECK-SM80-NEXT:    { .reg .b16 tmp; mov.b32 {tmp, %rs1}, %r7; }
-; CHECK-SM80-NEXT:    // implicit-def: %rs2
-; CHECK-SM80-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM80-NEXT:    add.rn.f16x2 %r9, %r7, %r8;
-; CHECK-SM80-NEXT:    { .reg .b16 tmp; mov.b32 {%rs3, tmp}, %r9; }
-; CHECK-SM80-NEXT:    mov.b16 %rs4, 0x0000;
-; CHECK-SM80-NEXT:    add.rn.f16 %rs5, %rs3, %rs4;
-; CHECK-SM80-NEXT:    st.param.b16 [func_retval0], %rs5;
-; CHECK-SM80-NEXT:    ret;
-;
-; CHECK-SM100-LABEL: reduce_fadd_half_reassoc(
-; CHECK-SM100:       {
-; CHECK-SM100-NEXT:    .reg .b16 %rs<6>;
-; CHECK-SM100-NEXT:    .reg .b32 %r<10>;
-; CHECK-SM100-EMPTY:
-; CHECK-SM100-NEXT:  // %bb.0:
-; CHECK-SM100-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_fadd_half_reassoc_param_0];
-; CHECK-SM100-NEXT:    add.rn.f16x2 %r5, %r2, %r4;
-; CHECK-SM100-NEXT:    add.rn.f16x2 %r6, %r1, %r3;
-; CHECK-SM100-NEXT:    add.rn.f16x2 %r7, %r6, %r5;
-; CHECK-SM100-NEXT:    mov.b32 {_, %rs1}, %r7;
-; CHECK-SM100-NEXT:    // implicit-def: %rs2
-; CHECK-SM100-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM100-NEXT:    add.rn.f16x2 %r9, %r7, %r8;
-; CHECK-SM100-NEXT:    mov.b32 {%rs3, _}, %r9;
-; CHECK-SM100-NEXT:    mov.b16 %rs4, 0x0000;
-; CHECK-SM100-NEXT:    add.rn.f16 %rs5, %rs3, %rs4;
-; CHECK-SM100-NEXT:    st.param.b16 [func_retval0], %rs5;
-; CHECK-SM100-NEXT:    ret;
+; CHECK-LABEL: reduce_fadd_half_reassoc(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<6>;
+; CHECK-NEXT:    .reg .b32 %r<8>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_fadd_half_reassoc_param_0];
+; CHECK-NEXT:    add.rn.f16x2 %r5, %r2, %r4;
+; CHECK-NEXT:    add.rn.f16x2 %r6, %r1, %r3;
+; CHECK-NEXT:    add.rn.f16x2 %r7, %r6, %r5;
+; CHECK-NEXT:    mov.b32 {%rs1, %rs2}, %r7;
+; CHECK-NEXT:    add.rn.f16 %rs3, %rs1, %rs2;
+; CHECK-NEXT:    mov.b16 %rs4, 0x0000;
+; CHECK-NEXT:    add.rn.f16 %rs5, %rs3, %rs4;
+; CHECK-NEXT:    st.param.b16 [func_retval0], %rs5;
+; CHECK-NEXT:    ret;
   %res = call reassoc half @llvm.vector.reduce.fadd(half 0.0, <8 x half> %in)
   ret half %res
 }
@@ -205,41 +182,20 @@ define half @reduce_fmul_half(<8 x half> %in) {
 }
 
 define half @reduce_fmul_half_reassoc(<8 x half> %in) {
-; CHECK-SM80-LABEL: reduce_fmul_half_reassoc(
-; CHECK-SM80:       {
-; CHECK-SM80-NEXT:    .reg .b16 %rs<4>;
-; CHECK-SM80-NEXT:    .reg .b32 %r<10>;
-; CHECK-SM80-EMPTY:
-; CHECK-SM80-NEXT:  // %bb.0:
-; CHECK-SM80-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_fmul_half_reassoc_param_0];
-; CHECK-SM80-NEXT:    mul.rn.f16x2 %r5, %r2, %r4;
-; CHECK-SM80-NEXT:    mul.rn.f16x2 %r6, %r1, %r3;
-; CHECK-SM80-NEXT:    mul.rn.f16x2 %r7, %r6, %r5;
-; CHECK-SM80-NEXT:    { .reg .b16 tmp; mov.b32 {tmp, %rs1}, %r7; }
-; CHECK-SM80-NEXT:    // implicit-def: %rs2
-; CHECK-SM80-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM80-NEXT:    mul.rn.f16x2 %r9, %r7, %r8;
-; CHECK-SM80-NEXT:    { .reg .b16 tmp; mov.b32 {%rs3, tmp}, %r9; }
-; CHECK-SM80-NEXT:    st.param.b16 [func_retval0], %rs3;
-; CHECK-SM80-NEXT:    ret;
-;
-; CHECK-SM100-LABEL: reduce_fmul_half_reassoc(
-; CHECK-SM100:       {
-; CHECK-SM100-NEXT:    .reg .b16 %rs<4>;
-; CHECK-SM100-NEXT:    .reg .b32 %r<10>;
-; CHECK-SM100-EMPTY:
-; CHECK-SM100-NEXT:  // %bb.0:
-; CHECK-SM100-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_fmul_half_reassoc_param_0];
-; CHECK-SM100-NEXT:    mul.rn.f16x2 %r5, %r2, %r4;
-; CHECK-SM100-NEXT:    mul.rn.f16x2 %r6, %r1, %r3;
-; CHECK-SM100-NEXT:    mul.rn.f16x2 %r7, %r6, %r5;
-; CHECK-SM100-NEXT:    mov.b32 {_, %rs1}, %r7;
-; CHECK-SM100-NEXT:    // implicit-def: %rs2
-; CHECK-SM100-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM100-NEXT:    mul.rn.f16x2 %r9, %r7, %r8;
-; CHECK-SM100-NEXT:    mov.b32 {%rs3, _}, %r9;
-; CHECK-SM100-NEXT:    st.param.b16 [func_retval0], %rs3;
-; CHECK-SM100-NEXT:    ret;
+; CHECK-LABEL: reduce_fmul_half_reassoc(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<4>;
+; CHECK-NEXT:    .reg .b32 %r<8>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_fmul_half_reassoc_param_0];
+; CHECK-NEXT:    mul.rn.f16x2 %r5, %r2, %r4;
+; CHECK-NEXT:    mul.rn.f16x2 %r6, %r1, %r3;
+; CHECK-NEXT:    mul.rn.f16x2 %r7, %r6, %r5;
+; CHECK-NEXT:    mov.b32 {%rs1, %rs2}, %r7;
+; CHECK-NEXT:    mul.rn.f16 %rs3, %rs1, %rs2;
+; CHECK-NEXT:    st.param.b16 [func_retval0], %rs3;
+; CHECK-NEXT:    ret;
   %res = call reassoc half @llvm.vector.reduce.fmul(half 1.0, <8 x half> %in)
   ret half %res
 }
@@ -401,7 +357,6 @@ define half @reduce_fmax_half_reassoc_nonpow2(<7 x half> %in) {
 
 ; Check straight-line reduction.
 define float @reduce_fmax_float(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmax_float(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -423,7 +378,6 @@ define float @reduce_fmax_float(<8 x float> %in) {
 }
 
 define float @reduce_fmax_float_reassoc(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmax_float_reassoc(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -445,7 +399,6 @@ define float @reduce_fmax_float_reassoc(<8 x float> %in) {
 }
 
 define float @reduce_fmax_float_reassoc_nonpow2(<7 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmax_float_reassoc_nonpow2(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<14>;
@@ -533,7 +486,6 @@ define half @reduce_fmin_half_reassoc_nonpow2(<7 x half> %in) {
 
 ; Check straight-line reduction.
 define float @reduce_fmin_float(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmin_float(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -555,7 +507,6 @@ define float @reduce_fmin_float(<8 x float> %in) {
 }
 
 define float @reduce_fmin_float_reassoc(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmin_float_reassoc(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -665,7 +616,6 @@ define half @reduce_fmaximum_half_reassoc_nonpow2(<7 x half> %in) {
 
 ; Check straight-line reduction.
 define float @reduce_fmaximum_float(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmaximum_float(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -687,7 +637,6 @@ define float @reduce_fmaximum_float(<8 x float> %in) {
 }
 
 define float @reduce_fmaximum_float_reassoc(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmaximum_float_reassoc(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -709,7 +658,6 @@ define float @reduce_fmaximum_float_reassoc(<8 x float> %in) {
 }
 
 define float @reduce_fmaximum_float_reassoc_nonpow2(<7 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fmaximum_float_reassoc_nonpow2(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<14>;
@@ -797,7 +745,6 @@ define half @reduce_fminimum_half_reassoc_nonpow2(<7 x half> %in) {
 
 ; Check straight-line reduction.
 define float @reduce_fminimum_float(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fminimum_float(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -819,7 +766,6 @@ define float @reduce_fminimum_float(<8 x float> %in) {
 }
 
 define float @reduce_fminimum_float_reassoc(<8 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fminimum_float_reassoc(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<16>;
@@ -841,7 +787,6 @@ define float @reduce_fminimum_float_reassoc(<8 x float> %in) {
 }
 
 define float @reduce_fminimum_float_reassoc_nonpow2(<7 x float> %in) {
-;
 ; CHECK-LABEL: reduce_fminimum_float_reassoc_nonpow2(
 ; CHECK:       {
 ; CHECK-NEXT:    .reg .b32 %r<14>;
@@ -888,20 +833,17 @@ define i16 @reduce_add_i16(<8 x i16> %in) {
 ; CHECK-SM100-LABEL: reduce_add_i16(
 ; CHECK-SM100:       {
 ; CHECK-SM100-NEXT:    .reg .b16 %rs<4>;
-; CHECK-SM100-NEXT:    .reg .b32 %r<11>;
+; CHECK-SM100-NEXT:    .reg .b32 %r<9>;
 ; CHECK-SM100-EMPTY:
 ; CHECK-SM100-NEXT:  // %bb.0:
 ; CHECK-SM100-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_add_i16_param_0];
 ; CHECK-SM100-NEXT:    add.s16x2 %r5, %r2, %r4;
 ; CHECK-SM100-NEXT:    add.s16x2 %r6, %r1, %r3;
 ; CHECK-SM100-NEXT:    add.s16x2 %r7, %r6, %r5;
-; CHECK-SM100-NEXT:    mov.b32 {_, %rs1}, %r7;
-; CHECK-SM100-NEXT:    // implicit-def: %rs2
-; CHECK-SM100-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM100-NEXT:    add.s16x2 %r9, %r7, %r8;
-; CHECK-SM100-NEXT:    mov.b32 {%rs3, _}, %r9;
-; CHECK-SM100-NEXT:    cvt.u32.u16 %r10, %rs3;
-; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r10;
+; CHECK-SM100-NEXT:    mov.b32 {%rs1, %rs2}, %r7;
+; CHECK-SM100-NEXT:    add.s16 %rs3, %rs1, %rs2;
+; CHECK-SM100-NEXT:    cvt.u32.u16 %r8, %rs3;
+; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r8;
 ; CHECK-SM100-NEXT:    ret;
   %res = call i16 @llvm.vector.reduce.add(<8 x i16> %in)
   ret i16 %res
@@ -1114,20 +1056,17 @@ define i16 @reduce_umax_i16(<8 x i16> %in) {
 ; CHECK-SM100-LABEL: reduce_umax_i16(
 ; CHECK-SM100:       {
 ; CHECK-SM100-NEXT:    .reg .b16 %rs<4>;
-; CHECK-SM100-NEXT:    .reg .b32 %r<11>;
+; CHECK-SM100-NEXT:    .reg .b32 %r<9>;
 ; CHECK-SM100-EMPTY:
 ; CHECK-SM100-NEXT:  // %bb.0:
 ; CHECK-SM100-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_umax_i16_param_0];
 ; CHECK-SM100-NEXT:    max.u16x2 %r5, %r2, %r4;
 ; CHECK-SM100-NEXT:    max.u16x2 %r6, %r1, %r3;
 ; CHECK-SM100-NEXT:    max.u16x2 %r7, %r6, %r5;
-; CHECK-SM100-NEXT:    mov.b32 {_, %rs1}, %r7;
-; CHECK-SM100-NEXT:    // implicit-def: %rs2
-; CHECK-SM100-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM100-NEXT:    max.u16x2 %r9, %r7, %r8;
-; CHECK-SM100-NEXT:    mov.b32 {%rs3, _}, %r9;
-; CHECK-SM100-NEXT:    cvt.u32.u16 %r10, %rs3;
-; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r10;
+; CHECK-SM100-NEXT:    mov.b32 {%rs1, %rs2}, %r7;
+; CHECK-SM100-NEXT:    max.u16 %rs3, %rs1, %rs2;
+; CHECK-SM100-NEXT:    cvt.u32.u16 %r8, %rs3;
+; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r8;
 ; CHECK-SM100-NEXT:    ret;
   %res = call i16 @llvm.vector.reduce.umax(<8 x i16> %in)
   ret i16 %res
@@ -1248,20 +1187,17 @@ define i16 @reduce_umin_i16(<8 x i16> %in) {
 ; CHECK-SM100-LABEL: reduce_umin_i16(
 ; CHECK-SM100:       {
 ; CHECK-SM100-NEXT:    .reg .b16 %rs<4>;
-; CHECK-SM100-NEXT:    .reg .b32 %r<11>;
+; CHECK-SM100-NEXT:    .reg .b32 %r<9>;
 ; CHECK-SM100-EMPTY:
 ; CHECK-SM100-NEXT:  // %bb.0:
 ; CHECK-SM100-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_umin_i16_param_0];
 ; CHECK-SM100-NEXT:    min.u16x2 %r5, %r2, %r4;
 ; CHECK-SM100-NEXT:    min.u16x2 %r6, %r1, %r3;
 ; CHECK-SM100-NEXT:    min.u16x2 %r7, %r6, %r5;
-; CHECK-SM100-NEXT:    mov.b32 {_, %rs1}, %r7;
-; CHECK-SM100-NEXT:    // implicit-def: %rs2
-; CHECK-SM100-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM100-NEXT:    min.u16x2 %r9, %r7, %r8;
-; CHECK-SM100-NEXT:    mov.b32 {%rs3, _}, %r9;
-; CHECK-SM100-NEXT:    cvt.u32.u16 %r10, %rs3;
-; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r10;
+; CHECK-SM100-NEXT:    mov.b32 {%rs1, %rs2}, %r7;
+; CHECK-SM100-NEXT:    min.u16 %rs3, %rs1, %rs2;
+; CHECK-SM100-NEXT:    cvt.u32.u16 %r8, %rs3;
+; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r8;
 ; CHECK-SM100-NEXT:    ret;
   %res = call i16 @llvm.vector.reduce.umin(<8 x i16> %in)
   ret i16 %res
@@ -1382,20 +1318,17 @@ define i16 @reduce_smax_i16(<8 x i16> %in) {
 ; CHECK-SM100-LABEL: reduce_smax_i16(
 ; CHECK-SM100:       {
 ; CHECK-SM100-NEXT:    .reg .b16 %rs<4>;
-; CHECK-SM100-NEXT:    .reg .b32 %r<11>;
+; CHECK-SM100-NEXT:    .reg .b32 %r<9>;
 ; CHECK-SM100-EMPTY:
 ; CHECK-SM100-NEXT:  // %bb.0:
 ; CHECK-SM100-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_smax_i16_param_0];
 ; CHECK-SM100-NEXT:    max.s16x2 %r5, %r2, %r4;
 ; CHECK-SM100-NEXT:    max.s16x2 %r6, %r1, %r3;
 ; CHECK-SM100-NEXT:    max.s16x2 %r7, %r6, %r5;
-; CHECK-SM100-NEXT:    mov.b32 {_, %rs1}, %r7;
-; CHECK-SM100-NEXT:    // implicit-def: %rs2
-; CHECK-SM100-NEXT:    mov.b32 %r8, {%rs1, %rs2};
-; CHECK-SM100-NEXT:    max.s16x2 %r9, %r7, %r8;
-; CHECK-SM100-NEXT:    mov.b32 {%rs3, _}, %r9;
-; CHECK-SM100-NEXT:    cvt.u32.u16 %r10, %rs3;
-; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r10;
+; CHECK-SM100-NEXT:    mov.b32 {%rs1, %rs2}, %r7;
+; CHECK-SM100-NEXT:    max.s16 %rs3, %rs1, %rs2;
+; CHECK-SM100-NEXT:    cvt.u32.u16 %r8, %rs3;
+; CHECK-SM100-NEXT:    st.param.b32 [func_retval0], %r8;
 ; CHECK-SM100-NEXT:    ret;
   %res = call i16 @llvm.vector.reduce.smax(<8 x i16> %in)
   ret i16 %res
@@ -1516,20 +1449,17 @@ define i16 @reduce_smin_i16(<8 x i16> %in) {
 ; CHECK-SM100-LABEL: reduce_smin_i16(
 ; CHECK-SM100:       {
 ; CHECK-SM100-NEXT:    .reg .b16 %rs<4>;
-; CHECK-SM100-NEXT:    .reg .b32 %r<11>;
+; CHECK-SM100-NEXT:    .reg .b32 %r<9>;
 ; CHECK-SM100-EMPTY:
 ; CHECK-SM100-NEXT:  // %bb.0:
 ; CHECK-SM100-NEXT:    ld.param.v4.b32 {%r1, %r2, %r3, %r4}, [reduce_smin_i16_param_0];
 ; CHECK-SM100-NEXT:    min.s16x2 %r5, %r2, %r4;
 ; CHECK-SM100-NEXT:    min.s16x2 %r6, %r1, %r3;
 ; CHECK-SM100-NEXT:    min.s16x2 %r7, %r6, %r5;
-; CHECK-SM100-NEXT:    mov.b32 {_, %rs1}, %r7;
-; CHECK-SM100-NEXT:    // implicit-def: %rs2
-; CHECK-SM100-NEXT:    mov.b32...
[truncated]

Copy link
Contributor

@justinfargnoli justinfargnoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a fix for an inefficient lowering of vecreduce_fadd.

NVPTX isn't custom lowering vecreduce_fadd, correct? If that is the case, then this is an issue other targets could/do face.

Given that, why not improve the vecreduce_fadd lowering or implement this in DAGCombiner?

@AlexMaclean
Copy link
Member

I too wonder if this can be made more general. Is there a reason this optimization needs to be NVPTX specific? Maybe we could update the lowering of these operations based on if vector shuffle is legal. Or maybe we could add a DAG combiner rule that looks for BUILD_VECTOR <- [BIN_OP] <- extract_vector_elt and coverts to extract_vector_elt <- [BIN_OP].

@Prince781
Copy link
Contributor Author

The default expansion of reduction intrinsics happens in ExpandReductions, which is an IR-level pass. The pass generates a shuffle reduction sequence, which uses shufflevectors to iteratively fold the vector. This sequence is not inherently problematic unless the target supports vector types in a single register. I will investigate if we can have the fixup in DAGCombiner if the final shuffle is legal, as @AlexMaclean suggests.

Here's a trace of what the old behavior is/was on this IR:
target triple = "nvptx64-nvidia-cuda"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"

define i16 @reduce_smin_i16(<8 x i16> %in) {
  %res = call i16 @llvm.vector.reduce.smin(<8 x i16> %in)
  ret i16 %res
}
% llc < reduction-intrinsics.ll -debug-only=isel -mcpu=sm_100 -print-before=expand-reductions -print-after=expand-reductions
*** IR Dump Before Expand reduction intrinsics (expand-reductions) ***
define i16 @reduce_smin_i16(<8 x i16> %in) #0 {
  %res = call i16 @llvm.vector.reduce.smin.v8i16(<8 x i16> %in)
  ret i16 %res
}
*** IR Dump After Expand reduction intrinsics (expand-reductions) ***
define i16 @reduce_smin_i16(<8 x i16> %in) #0 {
  %rdx.shuf = shufflevector <8 x i16> %in, <8 x i16> poison, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
  %rdx.minmax = call <8 x i16> @llvm.smin.v8i16(<8 x i16> %in, <8 x i16> %rdx.shuf)
  %rdx.shuf1 = shufflevector <8 x i16> %rdx.minmax, <8 x i16> poison, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %rdx.minmax2 = call <8 x i16> @llvm.smin.v8i16(<8 x i16> %rdx.minmax, <8 x i16> %rdx.shuf1)
  %rdx.shuf3 = shufflevector <8 x i16> %rdx.minmax2, <8 x i16> poison, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %rdx.minmax4 = call <8 x i16> @llvm.smin.v8i16(<8 x i16> %rdx.minmax2, <8 x i16> %rdx.shuf3)
  %1 = extractelement <8 x i16> %rdx.minmax4, i32 0
  ret i16 %1
}
	FastISel is disabled



=== reduce_smin_i16

Initial selection DAG: %bb.0 'reduce_smin_i16:'
SelectionDAG has 24 nodes:
  t0: ch,glue = EntryToken
  t4: v8i16,ch = load<(dereferenceable invariant load (s128), addrspace 101)> t0, TargetExternalSymbol:i64'reduce_smin_i16_param_0', undef:i64
  t5: v2i16 = extract_subvector t4, Constant:i64<0>
  t7: v2i16 = extract_subvector t4, Constant:i64<2>
  t9: v2i16 = extract_subvector t4, Constant:i64<4>
  t11: v2i16 = extract_subvector t4, Constant:i64<6>
    t13: v8i16 = vector_shuffle<4,5,6,7,u,u,u,u> t4, poison:v8i16
  t14: v8i16 = smin t4, t13
    t15: v8i16 = vector_shuffle<2,3,u,u,u,u,u,u> t14, poison:v8i16
  t16: v8i16 = smin t14, t15
            t17: v8i16 = vector_shuffle<1,u,u,u,u,u,u,u> t16, poison:v8i16
          t18: v8i16 = smin t16, t17
        t20: i16 = extract_vector_elt t18, Constant:i64<0>
      t21: i32 = zero_extend t20
    t22: ch = NVPTXISD::StoreRetval<(store (s32), align 2)> t0, Constant:i32<0>, t21
  t23: ch = NVPTXISD::RET_GLUE t22



Optimized lowered selection DAG: %bb.0 'reduce_smin_i16:'
SelectionDAG has 17 nodes:
  t0: ch,glue = EntryToken
  t4: v8i16,ch = load<(dereferenceable invariant load (s128), addrspace 101)> t0, TargetExternalSymbol:i64'reduce_smin_i16_param_0', undef:i64
    t13: v8i16 = vector_shuffle<4,5,6,7,u,u,u,u> t4, poison:v8i16
  t14: v8i16 = smin t4, t13
    t15: v8i16 = vector_shuffle<2,3,u,u,u,u,u,u> t14, poison:v8i16
  t16: v8i16 = smin t14, t15
            t17: v8i16 = vector_shuffle<1,u,u,u,u,u,u,u> t16, poison:v8i16
          t18: v8i16 = smin t16, t17
        t20: i16 = extract_vector_elt t18, Constant:i64<0>
      t21: i32 = zero_extend t20
    t22: ch = NVPTXISD::StoreRetval<(store (s32), align 2)> t0, Constant:i32<0>, t21
  t23: ch = NVPTXISD::RET_GLUE t22



Type-legalized selection DAG: %bb.0 'reduce_smin_i16:'
SelectionDAG has 16 nodes:
  t0: ch,glue = EntryToken
  t26: v2i16,v2i16,v2i16,v2i16,ch = NVPTXISD::LoadV4<(dereferenceable invariant load (s128), addrspace 101)> t0, TargetExternalSymbol:i64'reduce_smin_i16_param_0', undef:i64, Constant:i64<0>
    t56: v2i16 = smin t26, t26:2
    t57: v2i16 = smin t26:1, t26:3
  t58: v2i16 = smin t56, t57
            t60: v2i16 = vector_shuffle<1,u> t58, undef:v2i16
          t61: v2i16 = smin t58, t60
        t20: i16 = extract_vector_elt t61, Constant:i64<0>
      t21: i32 = zero_extend t20
    t22: ch = NVPTXISD::StoreRetval<(store (s32), align 2)> t0, Constant:i32<0>, t21
  t23: ch = NVPTXISD::RET_GLUE t22



Optimized type-legalized selection DAG: %bb.0 'reduce_smin_i16:'
SelectionDAG has 16 nodes:
  t0: ch,glue = EntryToken
  t26: v2i16,v2i16,v2i16,v2i16,ch = NVPTXISD::LoadV4<(dereferenceable invariant load (s128), addrspace 101)> t0, TargetExternalSymbol:i64'reduce_smin_i16_param_0', undef:i64, Constant:i64<0>
    t56: v2i16 = smin t26, t26:2
    t57: v2i16 = smin t26:1, t26:3
  t58: v2i16 = smin t56, t57
            t60: v2i16 = vector_shuffle<1,u> t58, undef:v2i16
          t61: v2i16 = smin t58, t60
        t20: i16 = extract_vector_elt t61, Constant:i64<0>
      t21: i32 = zero_extend t20
    t22: ch = NVPTXISD::StoreRetval<(store (s32), align 2)> t0, Constant:i32<0>, t21
  t23: ch = NVPTXISD::RET_GLUE t22



Legalized selection DAG: %bb.0 'reduce_smin_i16:'
SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
  t26: v2i16,v2i16,v2i16,v2i16,ch = NVPTXISD::LoadV4<(dereferenceable invariant load (s128), addrspace 101)> t0, TargetExternalSymbol:i64'reduce_smin_i16_param_0', undef:i64, Constant:i64<0>
    t56: v2i16 = smin t26, t26:2
    t57: v2i16 = smin t26:1, t26:3
  t58: v2i16 = smin t56, t57
              t64: i16 = extract_vector_elt t58, Constant:i64<1>
            t66: v2i16 = BUILD_VECTOR t64, undef:i16
          t61: v2i16 = smin t58, t66
        t20: i16 = extract_vector_elt t61, Constant:i64<0>
      t21: i32 = zero_extend t20
    t22: ch = NVPTXISD::StoreRetval<(store (s32), align 2)> t0, Constant:i32<0>, t21
  t23: ch = NVPTXISD::RET_GLUE t22



Optimized legalized selection DAG: %bb.0 'reduce_smin_i16:'
SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
  t26: v2i16,v2i16,v2i16,v2i16,ch = NVPTXISD::LoadV4<(dereferenceable invariant load (s128), addrspace 101)> t0, TargetExternalSymbol:i64'reduce_smin_i16_param_0', undef:i64, Constant:i64<0>
    t56: v2i16 = smin t26, t26:2
    t57: v2i16 = smin t26:1, t26:3
  t58: v2i16 = smin t56, t57
              t64: i16 = extract_vector_elt t58, Constant:i64<1>
            t66: v2i16 = BUILD_VECTOR t64, undef:i16
          t61: v2i16 = smin t58, t66
        t20: i16 = extract_vector_elt t61, Constant:i64<0>
      t21: i32 = zero_extend t20
    t22: ch = NVPTXISD::StoreRetval<(store (s32), align 2)> t0, Constant:i32<0>, t21
  t23: ch = NVPTXISD::RET_GLUE t22


===== Instruction selection begins: %bb.0 ''

ISEL: Starting selection on root node: t23: ch = NVPTXISD::RET_GLUE t22
ISEL: Starting pattern match
  Morphed node: t23: ch = Return t22
ISEL: Match complete!

ISEL: Starting selection on root node: t22: ch = NVPTXISD::StoreRetval<(store (s32), align 2)> t0, Constant:i32<0>, t21

ISEL: Starting selection on root node: t21: i32 = zero_extend t20
ISEL: Starting pattern match
  Initial Opcode index to 93850
  TypeSwitch[i32] from 93851 to 93868
  Skipped scope entry (due to false predicate) at index 93870, continuing at 93884
  Morphed node: t21: i32 = CVT_u32_u16 t20, TargetConstant:i32<0>
ISEL: Match complete!

ISEL: Starting selection on root node: t20: i16 = extract_vector_elt t61, Constant:i64<0>
ISEL: Starting pattern match
  Initial Opcode index to 90513
  TypeSwitch[i16] from 90518 to 90521
  Morphed node: t20: i16 = I32toI16L_Sink t61
ISEL: Match complete!

ISEL: Starting selection on root node: t61: v2i16 = smin t58, t66
ISEL: Starting pattern match
  Initial Opcode index to 91831
  Match failed at index 91836
  Continuing at 91872
  Match failed at index 91873
  Continuing at 91882
  Match failed at index 91883
  Continuing at 91891
  Match failed at index 91892
  Continuing at 91900
  Morphed node: t61: v2i16 = SMIN16x2 t58, t66
ISEL: Match complete!

ISEL: Starting selection on root node: t66: v2i16 = BUILD_VECTOR t64, undef:i16
ISEL: Starting pattern match
  Initial Opcode index to 99457
  Morphed node: t66: v2i16 = V2I16toI32 t64, undef:i16
ISEL: Match complete!

ISEL: Starting selection on root node: t64: i16 = extract_vector_elt t58, Constant:i64<1>
ISEL: Starting pattern match
  Initial Opcode index to 90513
  Skipped scope entry (due to false predicate) at index 90516, continuing at 90580
  TypeSwitch[i16] from 90583 to 90586
  Morphed node: t64: i16 = I32toI16H_Sink t58
ISEL: Match complete!

ISEL: Starting selection on root node: t58: v2i16 = smin t56, t57
ISEL: Starting pattern match
  Initial Opcode index to 91831
  Match failed at index 91836
  Continuing at 91872
  Match failed at index 91873
  Continuing at 91882
  Match failed at index 91883
  Continuing at 91891
  Match failed at index 91892
  Continuing at 91900
  Morphed node: t58: v2i16 = SMIN16x2 t56, t57
ISEL: Match complete!

ISEL: Starting selection on root node: t56: v2i16 = smin t26, t26:2
ISEL: Starting pattern match
  Initial Opcode index to 91831
  Match failed at index 91836
  Continuing at 91872
  Match failed at index 91873
  Continuing at 91882
  Match failed at index 91883
  Continuing at 91891
  Match failed at index 91892
  Continuing at 91900
  Morphed node: t56: v2i16 = SMIN16x2 t26, t26:2
ISEL: Match complete!

ISEL: Starting selection on root node: t57: v2i16 = smin t26:1, t26:3
ISEL: Starting pattern match
  Initial Opcode index to 91831
  Match failed at index 91836
  Continuing at 91872
  Match failed at index 91873
  Continuing at 91882
  Match failed at index 91883
  Continuing at 91891
  Match failed at index 91892
  Continuing at 91900
  Morphed node: t57: v2i16 = SMIN16x2 t26:1, t26:3
ISEL: Match complete!

ISEL: Starting selection on root node: t26: v2i16,v2i16,v2i16,v2i16,ch = NVPTXISD::LoadV4<(dereferenceable invariant load (s128), addrspace 101)> t0, TargetExternalSymbol:i64'reduce_smin_i16_param_0', undef:i64, Constant:i64<0>

ISEL: Starting selection on root node: t65: i16 = undef

ISEL: Starting selection on root node: t1: i64 = TargetExternalSymbol'reduce_smin_i16_param_0'

ISEL: Starting selection on root node: t0: ch,glue = EntryToken

===== Instruction selection ends:

Selected selection DAG: %bb.0 'reduce_smin_i16:'
SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
    t56: v2i16 = SMIN16x2 t72, t72:2
    t57: v2i16 = SMIN16x2 t72:1, t72:3
  t58: v2i16 = SMIN16x2 t56, t57
  t72: v2i16,v2i16,v2i16,v2i16,ch = LDV_i32_v4<Mem:(dereferenceable invariant load (s128), addrspace 101)> TargetConstant:i32<0>, TargetConstant:i32<0>, TargetConstant:i32<101>, TargetConstant:i32<3>, TargetConstant:i32<32>, TargetExternalSymbol:i64'reduce_smin_i16_param_0', TargetConstant:i32<0>, t0
              t64: i16 = I32toI16H_Sink t58
            t66: v2i16 = V2I16toI32 t64, IMPLICIT_DEF:i16
          t61: v2i16 = SMIN16x2 t58, t66
        t20: i16 = I32toI16L_Sink t61
      t21: i32 = CVT_u32_u16 t20, TargetConstant:i32<0>
    t68: ch = StoreRetvalI32<Mem:(store (s32), align 2)> t21, TargetConstant:i32<0>, t0
  t23: ch = Return t68


Total amount of phi nodes to update: 0
*** MachineFunction at end of ISel ***
# Machine code for function reduce_smin_i16: IsSSA, TracksLiveness

bb.0 (%ir-block.0):
  %0:int32regs, %1:int32regs, %2:int32regs, %3:int32regs = LDV_i32_v4 0, 0, 101, 3, 32, &reduce_smin_i16_param_0, 0 :: (dereferenceable invariant load (s128), addrspace 101)
  %4:int32regs = SMIN16x2 killed %1:int32regs, killed %3:int32regs
  %5:int32regs = SMIN16x2 killed %0:int32regs, killed %2:int32regs
  %6:int32regs = SMIN16x2 killed %5:int32regs, killed %4:int32regs
  %7:int16regs = I32toI16H_Sink %6:int32regs
  %9:int16regs = IMPLICIT_DEF
  %8:int32regs = V2I16toI32 killed %7:int16regs, killed %9:int16regs
  %10:int32regs = SMIN16x2 %6:int32regs, killed %8:int32regs
  %11:int16regs = I32toI16L_Sink killed %10:int32regs
  %12:int32regs = CVT_u32_u16 killed %11:int16regs, 0
  StoreRetvalI32 killed %12:int32regs, 0 :: (store (s32), align 2)
  Return

# End machine code for function reduce_smin_i16.

//
// Generated by LLVM NVPTX Back-End
//

.version 8.6
.target sm_100
.address_size 64

	// .globl	reduce_smin_i16         // -- Begin function reduce_smin_i16
                                        // @reduce_smin_i16
.visible .func  (.param .b32 func_retval0) reduce_smin_i16(
	.param .align 16 .b8 reduce_smin_i16_param_0[16]
)
{
	.reg .b16 	%rs<4>;
	.reg .b32 	%r<11>;

// %bb.0:
	ld.param.v4.b32 	{%r1, %r2, %r3, %r4}, [reduce_smin_i16_param_0];
	min.s16x2 	%r5, %r2, %r4;
	min.s16x2 	%r6, %r1, %r3;
	min.s16x2 	%r7, %r6, %r5;
	mov.b32 	{_, %rs1}, %r7;
	mov.b32 	%r8, {%rs1, %rs2};
	min.s16x2 	%r9, %r7, %r8;
	mov.b32 	{%rs3, _}, %r9;
	cvt.u32.u16 	%r10, %rs3;
	st.param.b32 	[func_retval0], %r10;
	ret;
                                        // -- End function
}

@Prince781
Copy link
Contributor Author

I ported this to DAGCombiner, but the optimization only winds up benefiting NVPTX. What do you guys think? Should we keep it in NVPTXISelLowering?

@Prince781 Prince781 force-pushed the dev/pferro/combine-packed-reductions branch from f879526 to 371a3ad Compare June 14, 2025 01:06
@llvmbot llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Jun 14, 2025
@Prince781
Copy link
Contributor Author

Some targets like x86 seem to handle this sequence just fine.

Copy link

github-actions bot commented Jun 14, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

The result of a reduction needs to be a scalar. When expressed as a
sequence of vector ops, the final op needs to reduce two (or more)
lanes belonging to the same register. On some targets a shuffle is not
available and this results in two extra movs to setup another vector
register with the lanes swapped. This pattern is now handled better by
turning it into a scalar op.
@Prince781 Prince781 force-pushed the dev/pferro/combine-packed-reductions branch from 371a3ad to dfa1a0f Compare June 18, 2025 16:54
@Prince781
Copy link
Contributor Author

So I've decided to merge this back into NVPTXISelLowering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:NVPTX llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants