-
Notifications
You must be signed in to change notification settings - Fork 5.1k
ARM64-SVE: Allow LCLs to be of type MASK #109286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Early draft version. Some TODOs and failures on other code I've run it on. The pass probably needs renaming / moving to a different file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some preliminary questions and would love to see the asmdiffs for the code.
@kunalspathak I generally wait with reviews until the PR is out of draft, unless @a74nh wants me to review it now? |
It seems like this would be better implemented later by making use of SSA. This is currently doing multiple IR walks which is unnecessary, and it is also not correct since nothing is verifying that the |
These early comments are useful in helping shape the direction of the work.
If that makes finding all the uses (and the parents of the uses) easier, then happy to switch.
My theory was that for most uses cases (outside of Fuzzlyn), when a variable of As a first version, simply making all LCLs store as A later PR could analyse all the uses and decide which is the dominating use and optimise that way. Maybe only turn on for AVX512 at this point. |
I'm more concerned about the correctness. For example, what happens for a case like
? If I'm reading the code right, strange things will happen that do not properly reflect the possibility of the "else" case. The transformation needs to behave correctly for cases like this. If you do it by making use of SSA, then you can easily know whether a use of a local that is going into |
I think I understand a bit better now after reading the code closer. For my case above, you will end up inserting I would probably suggest to shape it like this:
|
that's how typical practice it, but in this case, we wanted to seek early feedback before more progress is done (potentially in wrong direction). |
Can this be combined with any of the existing walks of IR? How early or late can this be done?
At what point is it ideal to do this insertion and removal? |
Probably, but I don't see a need to: these intrinsics are going to exist in very very few functions the JIT encounters, so a separate pass is going to run very rarely now that @a74nh added
Not sure that there is any one point that is strictly speaking better than others. The current position after local morph seems fine to me. |
Latest version uses hashtable as suggested. Value in the table is just a Needs a lot more commenting and tidying. |
Change-Id: Ic18f575e266d63db38f95601d374441cdbf28b44
@jakobbotsch : Consider... {
Vector<ushort> vr19 = Sve.CompareLessThanOrEqual(vr12, vr18);
var vr20 = Sve.TestAnyTrue(vr19, vr19);
Runtime_109286.M7(s_14, vr20, ref s_12, vr23, vr19);
}
[method: MethodImpl(MethodImplOptions.NoInlining)]
private static void M7(C2 argThis, bool arg0, ref Vector128<int> arg1, bool[] arg2, Vector<ushort> arg3)
{
}
Using a
And the user:
From those two, what's the generic way to parse When I have that, I want to call |
The first arg to the visit function is the edge (
It sounds like this transformation cannot be done in a local way after all: it needs to know information from the operations of the reaching definitions. The simple way would be to ensure in pass 1 that everyone agrees on the type of mask-to-vector conversion that was dropped so that you can use it when reinserting the vector-to-mask conversions in the second pass. |
Agreed. In the example I have a Is there a generic way of parsing a GenTree to look at all the args? |
....All review comments resolved again. However, looks like I have some Fuzzlyn issues. Will investigate. |
Looks there is a problem: public static void TestEntryPoint()
{
Vector<ushort> vr0 = Vector.Create<ushort>(65534);
bool x = Sve.TestLastTrue(vr0, vr0); // Use vr0 as a mask
Consume(x);
System.Console.WriteLine(vr0); // Use vr0 as a vector
} Which is essentially: public static void TestEntryPoint()
{
Vector<ushort> vr0 = Vector.Create<ushort>(65534);
bool x = Sve.TestLastTrue(ConvertVectorToMask(vr0), ConvertVectorToMask(vr0));
Consume(x);
System.Console.WriteLine(vr0);
} With optimisations off, this outputs With optimisations on, it optimises to: public static void TestEntryPoint()
{
mask<ushort> vr0 = ConvertVectorToMask(Vector.Create<ushort>(65534));
bool x = Sve.TestLastTrue(vr0, vr0);
Consume(x);
System.Console.WriteLine(ConvertMaskToVector(vr0));
} The I think what the pass needs to do is, if a vector is used as a vector (ie is used without a ConvertVectorToMask() attached) then it cannot be converted to store as a mask. The major use case this pass is trying to optimize is when a variable is created and then only used as a mask. This is still safe to do. To get the other cases, it can be done in the same way as suggestions for parameters - create a new store and update uses accordingly. Given we expect uses switching between masks and vectors to be the uncommon case, then I'm still happy to leave that as a later piece of work - probably in the spring. |
Fixed to not convert if used as vector. Added additional testing. I'll keep all the old tests that don't convert because they'll be useful later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Looks like there is a conflict, can you resolve it?
This comment was marked as off-topic.
This comment was marked as off-topic.
@kunalspathak can you take another look? (You are still marked as changes requested) |
Some performance figures. This was running on a graviton 3 with the vector length reduced to 128bits, so figures will be a little different compared to Cobalt 100, but the magnitude of change should be similar. These routines are taken from my blog which should be published this week and I can point to a source repo then.
|
Thank you for sharing this. I will take a look later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few comments.
MaskConversionsWeight defaultWeight; | ||
MaskConversionsWeight* weight = weightsTable->LookupPointerOrAdd(lclOp->GetLclNum(), defaultWeight); | ||
|
||
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "var", lclOp->GetLclNum(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "var", lclOp->GetLclNum(), | |
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "use", lclOp->GetLclNum(), |
// cannot be stored as a mask as data will be lost. | ||
// For all of these, conversions could be done by creating a new store of type mask. | ||
// Then uses as mask could be converted to type mask and pointed to use the new | ||
// definition. Tbe weighting would need updating to take this into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// definition. Tbe weighting would need updating to take this into account. | |
// definition. The weighting would need updating to take this into account. |
// Limitations: | ||
// | ||
// Local variables that are defined then immediately used just once may not be saved to a | ||
// store. Here a convert to to vector will be used by a convert to mask. These instances will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// store. Here a convert to to vector will be used by a convert to mask. These instances will | |
// store. Here a convert to vector will be used by a convert to mask. These instances will |
// To optimize this, the pass searches every local variable definition (GT_STORE_LCL_VAR) | ||
// and use (GT_LCL_VAR). A weighting is calculated and kept in a hash table - one entry | ||
// for each lclvar number. The weighting contains two values. The first value is the count of | ||
// of every convert node for the var, each instance multiplied by the number of instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// of every convert node for the var, each instance multiplied by the number of instructions | |
// of every convert node for the var - each instance multiplied by the number of instructions |
// for each lclvar number. The weighting contains two values. The first value is the count of | ||
// of every convert node for the var, each instance multiplied by the number of instructions | ||
// in the convert and the weighting of the block it exists in. The second value assumes the | ||
// local var has been switched to store as a mask and performs the same count. The switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// local var has been switched to store as a mask and performs the same count. The switch | |
// local var has been switched to the mask during the store and performs the similar count calculation to see what the cost of loading these "converted mask" values is back as a vector. |
void InvalidateWeight() | ||
{ | ||
JITDUMP("Invalidating weight. "); | ||
invalid = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also zero out the currentCost
and switchCost
to make sure we accidentally don't use them for invalidated weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are minor so its ok to do a follow-up PR for them.
These are now available at https://gitlab.arm.com/blogs/sveincsharp The full blog is https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/using-sve-in-csharp |
* ARM64-SVE: Allow LCLs to be of type MASK * Trigger based on OptimizationDisabled * Add compConvertMaskToVectorUsed check * Initial version with hashtable * Use double weighting method * Move to lclmorph * Better commenting * Add TARGET_ARM64 check * tidy * Add DEBUG ifdefs * Add mask check in lsrabuild Change-Id: Ic18f575e266d63db38f95601d374441cdbf28b44 * Add conversion for local var uses * Add conversion for local stores * Comment for simd types * Use weight_t for weighting values Change-Id: I0d39d59a121682e8e583cccd710d13f2dd33bdc5 * Account for block weights and number of instructions in weighting. * Fix asserts * Split weighting into current and switch * Add tests * Allow tests to tier * update comments * additional comments * abort walking once found * add comment to LCLMasksCheckLCLVarVisitor * LCLMasks -> LclMasks * LCLStore -> LclStore * LCLVar -> LclVar * UpdateVarWeight -> UpdateUseWeight * combine checks * Catch conversions of both types * better float printing * dump the updated tree with conversion * fix formatting * Update explanation * extra tests and remove asserts * move pass to lclmasks.cpp * Use vistors to iterate all nodes * Rename visitors * fix formatting * Add checks for LCL_ADDR * Add config option * Single Fact for all the tests * Only check statements where there is a local of type TYP_SIMD16/TYP_MASK * Call fgSequenceLocals() once per statement * Use JitSmallPrimitiveKeyFuncs * allow for nullptr user * Remove uses of gtBashToNOP() * Use DISPTREE * update asserts * Remove searching of convertOp * remove "method" in tests * Use LookupPointerOrAdd() * Remove Set() to table * fix formatting * Rename all functions * Use TypeIs() * invalidate if cached simdtype differs * use constructor for weightsTable Change-Id: I884307955274dac90bf1b30c5dd44be1e2917d49 * check for address exposed variables * Add allocator CMK_MaskConversionOpt * Simplify ChangeMatchUse.csproj * Hoist Sve check in testing * Check for parameters and OSR locals * rename tests * Don't convert uses of masks as vectors * fix formatting
Fixes #108241. Follow on to the worked started in #99608
SVE performance is being heavily hampered due to unnecessary conversion between vector and mask.
Consider
Here the mask will be converted to a vector for storage in
mask
then converted back into a mask for use inCompact
. However,mask
is a local variable so there are no requirements on it outside local scope. In this case the conversions can simply be removed, andmask
will be stored as a mask.Benchmarking a simple loop which takes two arrays, multiplies each element, then sums across. With this PR, the performance of SVE improves a lot:
Output for test
UseMaskAsMaskAndVector()
: https://gist.github.com/a74nh/fc2111440c9fe17040508952d7ea5bd0