Improve x86 arrayset #7763

ymanton · 2025-05-22T21:51:41Z

Improve x86 arrayset

For cases where the length of an arrayset is not known at compile-time,
or known to be >= 256 bytes, we generate a REP STOS sequence. This
is sub-optimal for small lengths as REP STOS has a relatively high
setup cost.

This patch implements a different sequence for a subset of the cases
that are currently handled by REP STOS. Specifically, if the arrayset
element size is 1 byte, it generates sequences to handle
the following lengths:

0 bytes
1-3 bytes, 2-3 stores, 1 branch
4-15 bytes, 4 stores, branch-free
16-31 bytes, 2 unaligned 16-byte stores, branch-free
32-63 bytes, 4 unaligned 16-byte stores, branch-free
>=64 bytes,
  1 unaligned 16-byte store,
  4 aligned 16-byte stores in a loop,
  3 aligned 16-byte stores in the residue
  1 unaligned 16-byte store in the residue
  1 compare + branch

If the length is known to be >=64 only the loop will be generated.

This code is loosely based on the approach discussed in
"Building Faster AMD64 Memset Routines" by Joe Bialek (January 11, 2021).
https://msrc.microsoft.com/blog/2021/01/building-faster-amd64-memset-routines/
We reduce path length and handle unaligned bytes more efficiently
by setting some bytes multiple times, on the assumption
that stores to overlapping memory ranges are cheaper than executing
extra comparisons and branches to set each byte exactly once.

Add x86 vpbroadcastb instruction

Add support for the following AVX2 x86 instruction:

VEX.128.66.0F38.W0 78 /r VPBROADCASTB xmm1, xmm2/m8

Broadcast a byte integer in the source
operand to sixteen locations in xmm1.

This PR should allow #7704 to be merged.

ymanton · 2025-05-27T14:23:11Z

Benchmark results from an Intel Broadwell machine:

ymanton · 2025-05-27T14:24:15Z

FYI @vijaysun-omr

vijaysun-omr · 2025-05-28T14:40:28Z

Excellent results. I would request review from @0xdaryl and optionally from @hzongaro and @BradleyWood for this.

vijaysun-omr · 2025-05-28T14:46:07Z

fyi to @r30shah @zl-wang and @knn-k too for similar thinking to be done on their respective platforms

vijaysun-omr · 2025-05-28T15:07:45Z

Can arrayset be used for element types wider than byte ? If so, does your optimization also help with all element types ?

BradleyWood · 2025-05-28T14:57:41Z

compiler/x/codegen/OMRTreeEvaluator.cpp

+   TR::Register *xmmValueReg = cg->allocateRegister(TR_FPR);
+
+   generateRegRegInstruction(TR::InstOpCode::MOVDRegReg4, node, xmmValueReg, valueReg, cg);
+   generateRegRegInstruction(TR::InstOpCode::VPBROADCASTBRegReg, node, xmmValueReg, xmmValueReg, cg);


~~This instruction belongs to AVX-512~~

Edit: there are two versions of this instruction, the one with GPR input requires AVX-512. On AVX-512 hardware you could use that version eliminate the movd instruction.

Thanks, I'll pursue this.

BradleyWood · 2025-05-28T14:59:48Z

compiler/x/codegen/X86Ops.ins

+            PROPERTY0(IA32OpProp_ModifiesTarget | IA32OpProp_SourceRegisterInModRM),
+            PROPERTY1(IA32OpProp1_XMMSource | IA32OpProp1_XMMTarget | IA32OpProp1_SIMDSingleSource),
+            FEATURES(X86FeatureProp_VEX128Supported | X86FeatureProp_VEX128RequiresAVX2 | X86FeatureProp_VEX256Supported | X86FeatureProp_VEX256RequiresAVX2 |
+                     X86FeatureProp_EVEX128Supported | X86FeatureProp_EVEX128RequiresAVX512F | X86FeatureProp_EVEX128RequiresAVX512VL |


~~VPBROADCASTB does not support VEX encoding. It is an AVX-512 instruction with no backward compatibility for AVX/AVX-2.~~ Edit: There are two versions of this instruction, one takes GPR input.

VPBROADCASTB EVEX requires these flags:

X86FeatureProp_EVEX128RequiresAVX512BW

X86FeatureProp_EVEX256RequiresAVX512BW

X86FeatureProp_EVEX512RequiresAVX512BW

Addressed. I've also added the w, d, and q versions of the instruction and updated the props for those as well.

BradleyWood · 2025-05-28T15:07:26Z

compiler/x/codegen/OMRTreeEvaluator.cpp

+
+   TR::Register *scratch1Reg = cg->allocateRegister(TR_GPR);
+   TR::Register *scratch2Reg = cg->allocateRegister(TR_GPR);
+   TR::Register *xmmValueReg = cg->allocateRegister(TR_FPR);


Needs to be TR_VRF

Fixed, thanks.

BradleyWood · 2025-05-28T15:13:20Z

compiler/x/codegen/OMRTreeEvaluator.cpp

+   generateLabelInstruction(TR::InstOpCode::label, node, loopLabel, cg);
+   // 64-byte per iteration aligned store loop
+   generateMemRegInstruction(TR::InstOpCode::MOVAPSMemReg, node, generateX86MemoryReference(addressReg, 0, cg), xmmValueReg, cg);
+   generateMemRegInstruction(TR::InstOpCode::MOVAPSMemReg, node, generateX86MemoryReference(addressReg, 16, cg), xmmValueReg, cg);


movaps requires alignment (to the size of the vector length), is the memory reference always aligned?

Yes, there's a preceding movups that will set any unaligned bytes, and then the pointer is bumped to the next 16-byte aligned address.

(Previous patch left the pointer at the same address if it was already aligned, so the same 16 bytes would be set with movups and then again by the first movaps in the loop. I've fixed that to move the pointer to the next aligned address regardless of whether it was previously aligned or not to improve that.)

BradleyWood · 2025-05-28T15:14:48Z

compiler/x/codegen/OMRTreeEvaluator.cpp

+   TR::Register *xmmValueReg = cg->allocateRegister(TR_FPR);
+
+   generateRegRegInstruction(TR::InstOpCode::MOVDRegReg4, node, xmmValueReg, valueReg, cg);
+   generateRegRegInstruction(TR::InstOpCode::VPBROADCASTBRegReg, node, xmmValueReg, xmmValueReg, cg);


I was working on a helper for broadcast operations, you can contribute it here if you'd like, and modify as needed to use the vpbroadcastb instruction when possible.

Thanks, I've cherry-picked it, added a patch on top, and made use of it here. Let me know if it works for you.

BradleyWood · 2025-05-28T15:17:36Z

compiler/x/codegen/X86Ops.ins

@@ -615,6 +615,15 @@ INSTRUCTION(VBROADCASTSDZmmXmm, vbroadcastsd,
            PROPERTY0(IA32OpProp_ModifiesTarget | IA32OpProp_SourceRegisterInModRM),
            PROPERTY1(IA32OpProp1_XMMSource | IA32OpProp1_XMMTarget | IA32OpProp1_SIMDSingleSource),
            FEATURES(X86FeatureProp_EVEX512Supported | X86FeatureProp_EVEX512RequiresAVX512F)),
+INSTRUCTION(VPBROADCASTBRegReg, vpbroadcastb,
+            BINARY(VEX_L128, VEX_vNONE, PREFIX_66, REX__, ESCAPE_0F38, 0x78, 0, ModRM_RM__, Immediate_0),


You should add some binary encoding tests in BinaryEncoder.cpp.

Done, thanks.

zl-wang · 2025-05-28T15:20:32Z

on power, codegen recognizes arrayset method and transforms it into Unsafe.setMemory() which codegen fast-paths. with pretty good performance ... i can dig it up somewhere.

BradleyWood · 2025-05-28T15:26:30Z

compiler/x/codegen/X86Ops.ins

@@ -615,6 +615,15 @@ INSTRUCTION(VBROADCASTSDZmmXmm, vbroadcastsd,
            PROPERTY0(IA32OpProp_ModifiesTarget | IA32OpProp_SourceRegisterInModRM),
            PROPERTY1(IA32OpProp1_XMMSource | IA32OpProp1_XMMTarget | IA32OpProp1_SIMDSingleSource),
            FEATURES(X86FeatureProp_EVEX512Supported | X86FeatureProp_EVEX512RequiresAVX512F)),
+INSTRUCTION(VPBROADCASTBRegReg, vpbroadcastb,


vpbroadcastb has this encoding:

EVEX.128.66.0F38.W0 7A /r

~~Should not have opcode byte of 0x78.~~

Edit: there are two versions of this instruction, one with GPR input.

ymanton · 2025-05-28T16:12:50Z

Can arrayset be used for element types wider than byte ? If so, does your optimization also help with all element types ?

Yes, arrayset can be used for wider types, but this implementation, as it is right now, can't.

It can be adapted to work with wider types, at the cost of additional path length, but I don't think it would be usable in a runtime like Java because it would introduce tearing. There's no way to do aligned stores on unaligned wider elements that wouldn't result in tearing; we'd have to give up aligned stores and see if it was still faster.

vijaysun-omr · 2025-05-28T16:29:50Z

One key use case for arrayset is to zero initialize stack allocated objects. Obviously it can be used in other contexts, but I thought I would mention that one since we could mark such arraysets in some way, with the key aspect being that stack allocated objects won't be visible on other threads, and therefore maybe the word tearing concerns may not apply.

vijaysun-omr · 2025-05-28T16:47:59Z

Even arraysets on newly allocated heap allocations may be possible to mark in some way, thus leading to assumption that the memory is not visible to another thread.

This brings up another question, are these techniques you have for arrayset optimization also possible to apply to the allocation code sequences we generate for new etc. ?

ymanton · 2025-05-28T20:40:22Z

One key use case for arrayset is to zero initialize stack allocated objects. Obviously it can be used in other contexts, but I thought I would mention that one since we could mark such arraysets in some way, with the key aspect being that stack allocated objects won't be visible on other threads, and therefore maybe the word tearing concerns may not apply.

Even arraysets on newly allocated heap allocations may be possible to mark in some way, thus leading to assumption that the memory is not visible to another thread.

Maybe a "allow tearing" node flag for arrayset (and arraycopy) would make sense? Any opt that creates an arrayset for zero init or otherwise has knowledge that tearing is allowed (e.g. because the language or runtime doesn't forbid it) could set the flag and codegens could take advantage of it in whichever way makes sense.

This brings up another question, are these techniques you have for arrayset optimization also possible to apply to the allocation code sequences we generate for new etc. ?

Yes, I have a patch for that in the works, but it's still not ready.

vijaysun-omr · 2025-05-28T23:01:47Z

The "allow tearing" node flag for arrayset (and arraycopy) would help with the problem.

Add support for the following AVX2 x86 instructions: vpbroadcastb vpbroadcastw vpbroadcastd vpbroadcastq Broadcast a byte/word/dword/qword integer in the source operand to sixteen locations in the target xmm. Signed-off-by: Younes Manton <[email protected]>

Signed-off-by: Younes Manton <[email protected]>

Signed-off-by: Bradley Wood <[email protected]>

If we have AVX2 we can directly broadcast 8 and 16 bit values instead of incrementally bumping them to 16 and then 32 bit values first. Signed-off-by: Younes Manton <[email protected]>

For cases where the length of an arrayset is not known at compile-time, or known to be >= 256 bytes, we generate a REP STOS sequence. This is sub-optimal for small lengths as REP STOS has a relatively high setup cost. This patch implements a different sequence for a subset of the cases that are currently handled by REP STOSB. Specifically, if the arrayset element size is 1 byte, it generates sequences to handle the following lengths: 0 bytes 1-3 bytes, 2-3 stores, 1 branch 4-15 bytes, 4 stores, branch-free 16-31 bytes, 2 unaligned 16-byte stores, branch-free 32-63 bytes, 4 unaligned 16-byte stores, branch-free >=64 bytes, per iteration as follows: 1 unaligned 16-byte store, 4 aligned 16-byte stores in a loop, 3 aligned 16-byte stores in the residue 1 unaligned 16-byte store in the residue 1 compare + branch If the length is known to be >=64 only the loop will be generated. Signed-off-by: Younes Manton <[email protected]>

BradleyWood · 2025-06-03T20:34:05Z

compiler/x/codegen/SIMDTreeEvaluator.cpp

+         switch (et)
+            {
+            case TR::Int8:
+               TR_ASSERT_FATAL(cg->comp()->target().cpu.supportsFeature(OMR_FEATURE_X86_AVX2), "8-bit to 128-bit vsplats requires AVX2");


So I expect Int8 and Int16 would crash if AVX2 is not supported. This original implementation was a little biased towards older micro archtechures by using the punpcklbw/pshuflw sequence to expand byte and word to 32-bits before using pshufd to expand broadcast to all lanes.

I think its best to write this in two sections, one for AVX2+ hardware and keep the older logic for SSE/AVX1. Some rough pseudocode below.

if (avx2) { switch (et) { case Int8: opcode = VPBROADCASTB; break; case Int16: opcode = VPBROADCASTW; break; case Int32: case Float: opcode = VPBROADCASTD; break; case Int64: case Double: opcode = VPBROADCASTQ; break; default: // assert (unexpected type) } encoding = opcode.getSIMDEncoding(&cg->comp()->target().cpu, vl); TR_ASSERT_FATAL(encoding != OMR::X86::Bad, "Broadcast instruction is not supported"); generateRegRegInstruction(opcode, node, vectorReg, vectorReg, cg, encoding); } else { // Expand byte and word to 32-bits. // assert if VL > 128; operation not supported // PSHUFD }

Try to run omr tests locally and set TR_Options=disableAVX2 or TR_Options=disableAVX

So I expect Int8 and Int16 would crash if AVX2 is not supported.

No, it works because Int8 and Int16 don't reach there. As written now if AVX2 isn't supported we'll do the 8->16->32 expansion first and et would be updated to Int32. Anyhow I can try it as suggested, I'm not particularly fond of the nested switches anyway.

OK, I see you set et = TR::Int32;

I'm not particularly fond of the nested switches anyway.

Either ways is OK. I will let you decide.

ymanton force-pushed the x86-arrayset branch 2 times, most recently from f1aabc1 to f2fb9c3 Compare May 27, 2025 14:20

ymanton marked this pull request as ready for review May 27, 2025 14:24

ymanton requested review from 0xdaryl and mstoodle as code owners May 27, 2025 14:24

BradleyWood reviewed May 28, 2025

View reviewed changes

knn-k mentioned this pull request May 29, 2025

AArch64: Investigate arrayset performance #7769

Open

ymanton and others added 5 commits June 3, 2025 11:10

Add vpbroadcast{b,w,d,q} binary encoder unit tests

8e5212a

Signed-off-by: Younes Manton <[email protected]>

x86: Create vector broadcast helper

32a90a2

Signed-off-by: Bradley Wood <[email protected]>

Add vpbroadcast{b,w} paths to broadcast helper

761c586

If we have AVX2 we can directly broadcast 8 and 16 bit values instead of incrementally bumping them to 16 and then 32 bit values first. Signed-off-by: Younes Manton <[email protected]>

ymanton force-pushed the x86-arrayset branch from f2fb9c3 to 27ae963 Compare June 3, 2025 18:12

BradleyWood reviewed Jun 3, 2025

View reviewed changes

0xdaryl added comp:compiler arch:x86 labels Jun 27, 2025

Improve x86 arrayset #7763

Are you sure you want to change the base?

Improve x86 arrayset #7763

Uh oh!

Conversation

ymanton commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ymanton commented May 27, 2025

Uh oh!

ymanton commented May 27, 2025

Uh oh!

vijaysun-omr commented May 28, 2025

Uh oh!

vijaysun-omr commented May 28, 2025

Uh oh!

vijaysun-omr commented May 28, 2025

Uh oh!

BradleyWood May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BradleyWood May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zl-wang commented May 28, 2025

Uh oh!

BradleyWood May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ymanton commented May 28, 2025

Uh oh!

vijaysun-omr commented May 28, 2025

Uh oh!

vijaysun-omr commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ymanton commented May 28, 2025

Uh oh!

vijaysun-omr commented May 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ymanton commented May 22, 2025 •

edited

Loading

BradleyWood May 28, 2025 •

edited

Loading

BradleyWood May 28, 2025 •

edited

Loading

BradleyWood May 28, 2025 •

edited

Loading

vijaysun-omr commented May 28, 2025 •

edited

Loading