Skip to content

arm64: Change EQ/NE node to SETCC if the operand supports the zero flag #112235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 19, 2025

Conversation

jonathandavies-arm
Copy link
Contributor

* Add one line tests for adds, ands, bics & subs
* Add tests for negs
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 6, 2025
@jonathandavies-arm
Copy link
Contributor Author

@a74nh @kunalspathak @dotnet/arm64-contrib

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 6, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Comment on lines 4465 to 4501
#ifdef TARGET_ARM64
// Check to see if we have a relopOp1 that supports setting flags e.g. AND & relopOp2 is 0.
// We can then set flags on relopOp1, remove relopOp2 & cmp and insert a SETCC node after Op1.
GenTreeOp* relop = cmp->AsOp();
GenTree* relopOp1 = relop->gtGetOp1();
GenTree* relopOp2 = relop->gtGetOp2();
GenTree* parent = relop->gtGetParent(nullptr);
if (parent && parent->OperIsSimple() && comp->opts.OptimizationEnabled() && relop->OperIs(GT_EQ, GT_NE) &&
relopOp2->IsIntegralConst(0) && relopOp1->SupportsSettingZeroFlag() && IsInvariantInRange(relopOp1, parent))
{
// Update relopOp1 flags and remove cmp & relopOp2
relopOp1->gtFlags |= GTF_SET_FLAGS;
relopOp1->SetUnusedValue();
BlockRange().Remove(relopOp1);
BlockRange().InsertBefore(relop, relopOp1);
BlockRange().Remove(relop);
BlockRange().Remove(relopOp2);

// Insert a SETCC node after Op1
GenCondition cmpCondition = GenCondition::FromRelop(relop);
GenTreeCC* setcc = comp->gtNewCC(GT_SETCC, TYP_INT, cmpCondition);
BlockRange().InsertAfter(relopOp1, setcc);

// Update the parent op with the new SETCC node
GenTreeOp* parentOp = parent->AsOp();
if (parentOp->gtOp1 == relop)
{
parentOp->gtOp1 = setcc;
}
else
{
assert(parentOp->gtOp2 == relop);
parentOp->gtOp2 = setcc;
}
return setcc;
}
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just repeating the optimization that happens in TryLowerConditionToFlagsNode. Can you factor the logic out, or change this logic to use that function instead?

Also, I suspect this change does not need to be arm64 specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between TryLowerConditionToFlagsNode and this code is inserting the relopOp1.
In TryLowerConditionToFlagsNode it has BlockRange().InsertBefore(parent, relopOp1);
This block of code has BlockRange().InsertBefore(relop, relopOp1);

Changing TryLowerConditionToFlagsNode to the before relop breaks when LowerJTrue() calls it.

A couple of options are:

  • Add bool arg to TryLowerConditionToFlagsNode to select which insert we do.
  • Investigate why LowerJTrue() doesn't work with the change.

Any other ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into trying to use TryLowerConditionToFlagsNode in LowerCompare and to use it we will need to make 2 changes.

  1. Instead of inserting relopOp1 back in before the parent we need to insert it before relop This is so that if you have a parent node such as OR with operands EQ & EQ each with a ADD operand we can lower both EQ nodes to CSET and output ands in both cases.
    Before
N003 (  3,  3) [000002] -----+-----                    t2 = *  ADD       int    $100
N004 (  1,  2) [000003] -----+-----                    t3 =    CNS_INT   int    0 $40
                                                            /--*  t2     int    
                                                            +--*  t3     int    
N005 (  8,  6) [000004] -----+-----                    t4 = *  EQ        int    $101
N006 (  1,  1) [000005] -----+-----                    t5 =    LCL_VAR   int    V02 arg2         u:1 (last use) $82
N007 (  1,  1) [000006] -----+-----                    t6 =    LCL_VAR   int    V03 arg3         u:1 (last use) $83
                                                            /--*  t5     int    
                                                            +--*  t6     int    
N008 (  3,  3) [000007] -----+-----                    t7 = *  ADD       int    $102
N009 (  1,  2) [000008] -----+-----                    t8 =    CNS_INT   int    0 $40
                                                            /--*  t7     int    
                                                            +--*  t8     int    
N010 (  8,  6) [000009] -----+-----                    t9 = *  EQ        int    $103
                                                            /--*  t4     int    
                                                            +--*  t9     int    
N011 ( 17, 13) [000010] -----+-----                   t10 = *  OR        int    $104

After

N003 (  3,  3) [000002] -----+-----                    u2 = *  ADD       int    $100
               [000014] -----------                   t14 =    SETCC     int    cond=UEQ
N006 (  1,  1) [000005] -----+-----                    t5 =    LCL_VAR   int    V02 arg2         u:1 (last use) $82
N007 (  1,  1) [000006] -----+-----                    t6 =    LCL_VAR   int    V03 arg3         u:1 (last use) $83
                                                            /--*  t5     int    
                                                            +--*  t6     int    
N008 (  3,  3) [000007] -----+-----                    u7 = *  ADD       int    $102
               [000015] -----------                   t15 =    SETCC     int    cond=UEQ
                                                            /--*  t14    int    
                                                            +--*  t15    int    
N011 ( 17, 13) [000010] -----+-----                   t10 = *  OR        int    $104
  1. If we call TryLowerConditionToFlagsNode in LowerCompare and we fail these if conditions.
if (optimizing && relop->OperIs(GT_EQ, GT_NE) && relopOp2->IsIntegralConst(0) &&
            relopOp1->SupportsSettingZeroFlag() && IsInvariantInRange(relopOp1, parent))
        {

we don't want to go into the else case. It messes up the relop node and we get asserts.

I've made the changes that would be needed to TryLowerConditionToFlagsNode if we still want to call it here.
jonathandavies-arm@f34b9ba

I think I that we should either not call TryLowerConditionToFlagsNode in LowerCompare or extract out the if statement and 6 common lines into another function that can be called from TryLowerConditionToFlagsNode and LowerCompare

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check what the arm64 and x64 diffs look like if you completely move the optimization from TryLowerConditionToFlagsNode to LowerCompare? That is, what you had before except enabled for all platform, and then remove the corresponding optimization in TryLowerConditionToFlagsNode.
Given that the optimization here produces SETCC which is removable by TryLowerConditionToFlagsNode then it might not come with any significant regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done these 2 changes in these commits jonathandavies-arm@3d4c241 & jonathandavies-arm@7858064.
I've tried using superpmi these commits as the base & diff but I get the following asserts in lir.cpp.

[09:03:08] ================ Logging to /home/jondav01/projects/dotnet/runtime/artifacts/spmi/superpmi.7.log
[09:03:08] Using JIT/EE Version from jiteeversionguid.h: cc0e7adf-e397-40b6-9d14-a7149815c991
[09:03:08] Found download cache directory "/home/jondav01/projects/dotnet/runtime/artifacts/spmi/mch/cc0e7adf-e397-40b6-9d14-a7149815c991.linux.arm64" and --force_download not set; skipping download
[09:03:08] SuperPMI replay
[09:03:08] JIT Path: /home/jondav01/projects/dotnet/runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/libclrjit.so
[09:03:08] Using MCH files:
[09:03:08]   /home/jondav01/projects/dotnet/runtime/artifacts/spmi/mch/cc0e7adf-e397-40b6-9d14-a7149815c991.linux.arm64/benchmarks.run.linux.arm64.checked.mch
[09:03:08] Running SuperPMI replay of /home/jondav01/projects/dotnet/runtime/artifacts/spmi/mch/cc0e7adf-e397-40b6-9d14-a7149815c991.linux.arm64/benchmarks.run.linux.arm64.checked.mch
[09:03:09] Compilation failures
[09:03:10] ============================== Assertions:
[09:03:10] node->IsUnusedValue() && "found an unmarked unused value" (/home/jondav01/projects/dotnet/runtime/src/coreclr/jit/lir.cpp:1706)
[09:03:10]   /home/jondav01/projects/dotnet/runtime/artifacts/spmi/mch/cc0e7adf-e397-40b6-9d14-a7149815c991.linux.arm64/benchmarks.run.linux.arm64.checked.mch # 82 : IL size 7
[09:03:10]   /home/jondav01/projects/dotnet/runtime/artifacts/spmi/mch/cc0e7adf-e397-40b6-9d14-a7149815c991.linux.arm64/benchmarks.run.linux.arm64.checked.mch # 29020 : IL size 7
[09:03:10]   /home/jondav01/projects/dotnet/runtime/artifacts/spmi/mch/cc0e7adf-e397-40b6-9d14-a7149815c991.linux.arm64/benchmarks.run.linux.arm64.checked.mch # 4461 : IL size 8
[09:03:10]   /home/jondav01/projects/dotnet/runtime/artifacts/spmi/mch/cc0e7adf-e397-40b6-9d14-a7149815c991.linux.arm64/benchmarks.run.linux.arm64.checked.mch # 29845 : IL size 9
...

Please can you check if the changes I made are what you were thinking of?
I think atm we could remove the ARM64 ifdef but removing that if case from TryLowerConditionToFlagsNode seems to be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've remove the ARM64 ifdef and fixed a test.
I don't think we should remove that section from TryLowerConditionToFlagsNode. On x64 when it lowers a SELECT node it adds in a test to get the flags instead of setting the flags on a OR node (in the example I was looking at).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On x64 when it lowers a SELECT node it adds in a test to get the flags instead of setting the flags on a OR node (in the example I was looking at).

Can you share the jitdump of this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MethodImpl(MethodImplOptions.NoInlining)]
static void AndAndInner(int op1, int op2, int op3) {
    if (op1 % 2 == 0 && op2 % 2 == 0 && op3 % 2 == 0) {
        op1 = 5;
    }
    Consume(op1, op2, op3, 0);
}
G_M49537_IG01:  ;; offset=0x0000
       push     rax
						;; size=1 bbWeight=1 PerfScore 1.00
G_M49537_IG02:  ;; offset=0x0001
       mov      ecx, edi
       and      ecx, 1
       mov      eax, esi
       and      eax, 1
       or       ecx, eax
       mov      eax, edx
       and      eax, 1
       or       ecx, eax
       mov      eax, 5
       test     ecx, ecx
       cmove    edi, eax
       xor      ecx, ecx
       call     [TestAdd.Program:Consume(int,int,int,int)]
       nop      
						;; size=38 bbWeight=1 PerfScore 6.25
G_M49537_IG03:  ;; offset=0x0027
       add      rsp, 8
       ret      

diff_Add_AndAndInner_dump.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why the opt would not be kicking in for this case. I tried the following change locally and it definitely kicks in for me: https://github.com/dotnet/runtime/compare/main...jakobbotsch:move-opt?expand=1
I see negligible diffs for x64 with that change, so I would suggest switching to it. It also cleans up some stuff around the transformation (e.g. gtGetParent should not be used in LIR, instead we have BlockRange::TryGetUse) and moves it into OptimizeConstCompare where other similar optimizations are happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit removing my changes to lower.cpp and added yours.

parentOp->gtOp2 = setcc;
}
return setcc;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jakobbotsch
Copy link
Member

Looks like there are some conflicts in the tests

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jakobbotsch jakobbotsch merged commit 610ece3 into dotnet:main Feb 19, 2025
118 checks passed
grendello pushed a commit to grendello/runtime that referenced this pull request Feb 19, 2025
…ag (dotnet#112235)

* arm64: Change EQ/NE node to SETCC if the operand supports the zero flag
* Add one line tests for adds, ands, bics & subs
* Add tests for negs
@jonathandavies-arm jonathandavies-arm deleted the upstream/ce-add branch February 20, 2025 07:55
@github-actions github-actions bot locked and limited conversation to collaborators Mar 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants