-
Notifications
You must be signed in to change notification settings - Fork 402
Improve x86 arrayset #7763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improve x86 arrayset #7763
Conversation
f1aabc1
to
f2fb9c3
Compare
FYI @vijaysun-omr |
Excellent results. I would request review from @0xdaryl and optionally from @hzongaro and @BradleyWood for this. |
Can |
TR::Register *xmmValueReg = cg->allocateRegister(TR_FPR); | ||
|
||
generateRegRegInstruction(TR::InstOpCode::MOVDRegReg4, node, xmmValueReg, valueReg, cg); | ||
generateRegRegInstruction(TR::InstOpCode::VPBROADCASTBRegReg, node, xmmValueReg, xmmValueReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instruction belongs to AVX-512
Edit: there are two versions of this instruction, the one with GPR input requires AVX-512. On AVX-512 hardware you could use that version eliminate the movd instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll pursue this.
PROPERTY0(IA32OpProp_ModifiesTarget | IA32OpProp_SourceRegisterInModRM), | ||
PROPERTY1(IA32OpProp1_XMMSource | IA32OpProp1_XMMTarget | IA32OpProp1_SIMDSingleSource), | ||
FEATURES(X86FeatureProp_VEX128Supported | X86FeatureProp_VEX128RequiresAVX2 | X86FeatureProp_VEX256Supported | X86FeatureProp_VEX256RequiresAVX2 | | ||
X86FeatureProp_EVEX128Supported | X86FeatureProp_EVEX128RequiresAVX512F | X86FeatureProp_EVEX128RequiresAVX512VL | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VPBROADCASTB does not support VEX encoding. It is an AVX-512 instruction with no backward compatibility for AVX/AVX-2. Edit: There are two versions of this instruction, one takes GPR input.
VPBROADCASTB EVEX requires these flags:
- X86FeatureProp_EVEX128RequiresAVX512BW
- X86FeatureProp_EVEX256RequiresAVX512BW
- X86FeatureProp_EVEX512RequiresAVX512BW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed. I've also added the w, d, and q versions of the instruction and updated the props for those as well.
|
||
TR::Register *scratch1Reg = cg->allocateRegister(TR_GPR); | ||
TR::Register *scratch2Reg = cg->allocateRegister(TR_GPR); | ||
TR::Register *xmmValueReg = cg->allocateRegister(TR_FPR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be TR_VRF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
generateLabelInstruction(TR::InstOpCode::label, node, loopLabel, cg); | ||
// 64-byte per iteration aligned store loop | ||
generateMemRegInstruction(TR::InstOpCode::MOVAPSMemReg, node, generateX86MemoryReference(addressReg, 0, cg), xmmValueReg, cg); | ||
generateMemRegInstruction(TR::InstOpCode::MOVAPSMemReg, node, generateX86MemoryReference(addressReg, 16, cg), xmmValueReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
movaps requires alignment (to the size of the vector length), is the memory reference always aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's a preceding movups
that will set any unaligned bytes, and then the pointer is bumped to the next 16-byte aligned address.
(Previous patch left the pointer at the same address if it was already aligned, so the same 16 bytes would be set with movups
and then again by the first movaps
in the loop. I've fixed that to move the pointer to the next aligned address regardless of whether it was previously aligned or not to improve that.)
TR::Register *xmmValueReg = cg->allocateRegister(TR_FPR); | ||
|
||
generateRegRegInstruction(TR::InstOpCode::MOVDRegReg4, node, xmmValueReg, valueReg, cg); | ||
generateRegRegInstruction(TR::InstOpCode::VPBROADCASTBRegReg, node, xmmValueReg, xmmValueReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on a helper for broadcast operations, you can contribute it here if you'd like, and modify as needed to use the vpbroadcastb instruction when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've cherry-picked it, added a patch on top, and made use of it here. Let me know if it works for you.
@@ -615,6 +615,15 @@ INSTRUCTION(VBROADCASTSDZmmXmm, vbroadcastsd, | |||
PROPERTY0(IA32OpProp_ModifiesTarget | IA32OpProp_SourceRegisterInModRM), | |||
PROPERTY1(IA32OpProp1_XMMSource | IA32OpProp1_XMMTarget | IA32OpProp1_SIMDSingleSource), | |||
FEATURES(X86FeatureProp_EVEX512Supported | X86FeatureProp_EVEX512RequiresAVX512F)), | |||
INSTRUCTION(VPBROADCASTBRegReg, vpbroadcastb, | |||
BINARY(VEX_L128, VEX_vNONE, PREFIX_66, REX__, ESCAPE_0F38, 0x78, 0, ModRM_RM__, Immediate_0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add some binary encoding tests in BinaryEncoder.cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
on power, codegen recognizes |
@@ -615,6 +615,15 @@ INSTRUCTION(VBROADCASTSDZmmXmm, vbroadcastsd, | |||
PROPERTY0(IA32OpProp_ModifiesTarget | IA32OpProp_SourceRegisterInModRM), | |||
PROPERTY1(IA32OpProp1_XMMSource | IA32OpProp1_XMMTarget | IA32OpProp1_SIMDSingleSource), | |||
FEATURES(X86FeatureProp_EVEX512Supported | X86FeatureProp_EVEX512RequiresAVX512F)), | |||
INSTRUCTION(VPBROADCASTBRegReg, vpbroadcastb, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vpbroadcastb has this encoding:
EVEX.128.66.0F38.W0 7A /r
Should not have opcode byte of 0x78.
Edit: there are two versions of this instruction, one with GPR input.
Yes, It can be adapted to work with wider types, at the cost of additional path length, but I don't think it would be usable in a runtime like Java because it would introduce tearing. There's no way to do aligned stores on unaligned wider elements that wouldn't result in tearing; we'd have to give up aligned stores and see if it was still faster. |
One key use case for arrayset is to zero initialize stack allocated objects. Obviously it can be used in other contexts, but I thought I would mention that one since we could mark such arraysets in some way, with the key aspect being that stack allocated objects won't be visible on other threads, and therefore maybe the word tearing concerns may not apply. |
Even arraysets on newly allocated heap allocations may be possible to mark in some way, thus leading to assumption that the memory is not visible to another thread. This brings up another question, are these techniques you have for arrayset optimization also possible to apply to the allocation code sequences we generate for |
Maybe a "allow tearing" node flag for
Yes, I have a patch for that in the works, but it's still not ready. |
The "allow tearing" node flag for arrayset (and arraycopy) would help with the problem. |
Add support for the following AVX2 x86 instructions: vpbroadcastb vpbroadcastw vpbroadcastd vpbroadcastq Broadcast a byte/word/dword/qword integer in the source operand to sixteen locations in the target xmm. Signed-off-by: Younes Manton <[email protected]>
Signed-off-by: Younes Manton <[email protected]>
Signed-off-by: Bradley Wood <[email protected]>
If we have AVX2 we can directly broadcast 8 and 16 bit values instead of incrementally bumping them to 16 and then 32 bit values first. Signed-off-by: Younes Manton <[email protected]>
For cases where the length of an arrayset is not known at compile-time, or known to be >= 256 bytes, we generate a REP STOS sequence. This is sub-optimal for small lengths as REP STOS has a relatively high setup cost. This patch implements a different sequence for a subset of the cases that are currently handled by REP STOSB. Specifically, if the arrayset element size is 1 byte, it generates sequences to handle the following lengths: 0 bytes 1-3 bytes, 2-3 stores, 1 branch 4-15 bytes, 4 stores, branch-free 16-31 bytes, 2 unaligned 16-byte stores, branch-free 32-63 bytes, 4 unaligned 16-byte stores, branch-free >=64 bytes, per iteration as follows: 1 unaligned 16-byte store, 4 aligned 16-byte stores in a loop, 3 aligned 16-byte stores in the residue 1 unaligned 16-byte store in the residue 1 compare + branch If the length is known to be >=64 only the loop will be generated. Signed-off-by: Younes Manton <[email protected]>
switch (et) | ||
{ | ||
case TR::Int8: | ||
TR_ASSERT_FATAL(cg->comp()->target().cpu.supportsFeature(OMR_FEATURE_X86_AVX2), "8-bit to 128-bit vsplats requires AVX2"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I expect Int8 and Int16 would crash if AVX2 is not supported. This original implementation was a little biased towards older micro archtechures by using the punpcklbw/pshuflw sequence to expand byte and word to 32-bits before using pshufd to expand broadcast to all lanes.
I think its best to write this in two sections, one for AVX2+ hardware and keep the older logic for SSE/AVX1. Some rough pseudocode below.
if (avx2) {
switch (et) {
case Int8:
opcode = VPBROADCASTB; break;
case Int16:
opcode = VPBROADCASTW; break;
case Int32:
case Float:
opcode = VPBROADCASTD; break;
case Int64:
case Double:
opcode = VPBROADCASTQ; break;
default:
// assert (unexpected type)
}
encoding = opcode.getSIMDEncoding(&cg->comp()->target().cpu, vl);
TR_ASSERT_FATAL(encoding != OMR::X86::Bad, "Broadcast instruction is not supported");
generateRegRegInstruction(opcode, node, vectorReg, vectorReg, cg, encoding);
} else {
// Expand byte and word to 32-bits.
// assert if VL > 128; operation not supported
// PSHUFD
}
Try to run omr tests locally and set TR_Options=disableAVX2
or TR_Options=disableAVX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I expect Int8 and Int16 would crash if AVX2 is not supported.
No, it works because Int8 and Int16 don't reach there. As written now if AVX2 isn't supported we'll do the 8->16->32 expansion first and et
would be updated to Int32. Anyhow I can try it as suggested, I'm not particularly fond of the nested switches anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see you set et = TR::Int32;
I'm not particularly fond of the nested switches anyway.
Either ways is OK. I will let you decide.
Improve x86 arrayset
For cases where the length of an arrayset is not known at compile-time,
or known to be >= 256 bytes, we generate a REP STOS sequence. This
is sub-optimal for small lengths as REP STOS has a relatively high
setup cost.
This patch implements a different sequence for a subset of the cases
that are currently handled by REP STOS. Specifically, if the arrayset
element size is 1 byte, it generates sequences to handle
the following lengths:
If the length is known to be >=64 only the loop will be generated.
This code is loosely based on the approach discussed in
"Building Faster AMD64 Memset Routines" by Joe Bialek (January 11, 2021).
https://msrc.microsoft.com/blog/2021/01/building-faster-amd64-memset-routines/
We reduce path length and handle unaligned bytes more efficiently
by setting some bytes multiple times, on the assumption
that stores to overlapping memory ranges are cheaper than executing
extra comparisons and branches to set each byte exactly once.
Add x86 vpbroadcastb instruction
Add support for the following AVX2 x86 instruction:
VEX.128.66.0F38.W0 78 /r VPBROADCASTB xmm1, xmm2/m8
Broadcast a byte integer in the source
operand to sixteen locations in xmm1.
This PR should allow #7704 to be merged.