Skip to content

Investigate the impact of slower PAUSE instruction on Skylake+ #10773

Open
@mikezhang1234567890

Description

@mikezhang1234567890

Background

In Skylake and later, the PAUSE instruction takes an order of magnitude more cycles than it did in previous architectures.
https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/ provides good background information on this.

From Agner Fog's instructions table, here are the latencies of the PAUSE instructions on Skylake and previous generations:

Sandy Bridge    11
Ivy Bridege     10
Haswell          9
Broadwell        9
SkylakeX       141

Potential Impact On Locking Code

Conceptually, the PAUSE instruction should allow other threads to do work, and not block the other threads. However, if PAUSE takes 10x longer, then more time will be spent on spinning. A unit step in the spin code will take more time proportionate to the slowdown in the PAUSE instruction. This will increase the time to acquire a lock between unit steps, thus decreasing the throughput of lock acquiring operations. In this case, CAS might be cheaper than PAUSE instructions in low contention cases.

Uses of PAUSE insn in OpenJ9/OMR

In OpenJ9/OMR, PAUSE instruction is used in AtomicOperations::yieldCPU(). 
Here are the places, where yieldCPU is used:

OpenJ9: https://github.com/eclipse/openj9/search?q=yieldCPU&unscoped_q=yieldCPU
     BytecodeInterpreter.hpp::inlThreadOnSpinWait
     FastJNI_java_lang_Thread.cpp::Fast_java_lang_Thread_onSpinWait
     ObjectMonitor.cpp::spinOnFlatLock [Controllable by cmdline option: spin1/2/yield]
    ObjectMonitor.cpp::spinOnTryEnter [Controllable by cmdline option: tryenterspin1/2/yield]

OMR: https://github.com/eclipse/omr/search?q=yieldCPU&unscoped_q=yieldCPU
     LightweightNonReentrantReaderWriterLock.cpp::enterRead 
     LightweightNonReentrantReaderWriterLock.cpp::enterWrite
     gcspinlock.cpp::omrgc_spinlock_acquire [Disabled]
     threadhelpers.cpp::omrthread_mcs_lock
     threadhelpers.cpp::omrthread_spinlock_acquire [Controllable by cmdline option: threetierspin1/2/3]

Nested vs Non-Nested Mode

In the nested mode, the number of PAUSE insns = (yieldCount * spinCount2):

    for (yieldCount) // value = 45
        for (spinCount2) // value = 32
            CAS
            PAUSE
            for (spinCount1) // value = 256
                nop
        sched_yield or usleep

In the non-nested mode, the number of PAUSE insns = spinCount2:

    for (spinCount2) // value = 32
            CAS
            PAUSE
            for (spinCount1) // value = 256
                nop
    for (yieldCount - 1) // value = 45 - 1
        CAS
        sched_yield or usleep

Note: Non-nested mode is only available for spinOnFlatLock and spinOnTryEnter. 

Experiments to verify the impact of the PAUSE instruction change

Currently, we don't have evidence that the PAUSE instruction change has caused a performance regression, so we will run experiments varying the number of PAUSE instructions to see the performance impact. We do this by using command line options to control the number of instructions as mentioned in the above two sections.

Default spin counts:
-Xthr:spin1=256,spin2=32,yield=45,tryEnterSpin1=256,tryEnterSpin2=32,tryEnterYield=45,threeTierSpinCount1=256,threeTierSpinCount2=32,threeTierSpinCount3=45

The below experiments will show if changing the number of PAUSE insns will improve performance:
[Experiment 1] No spinning (spin-loops may execute ~1 pause per acquire):
-Xthr:minimizeUserCPU

[Experiment 2] Enable non-nested modes:
-Xthr:noNestedSpinning,noTryEnterNestedSpinning

[Experiment 3] Reduce spinCount2 by a factor of 10-15 to account for the slow down in the PAUSE insn:
-Xthr:spin1=256,spin2=3,yield=45,tryEnterSpin1=256,tryEnterSpin2=3,tryEnterYield=45,threeTierSpinCount1=256,threeTierSpinCount2=3,threeTierSpinCount3=45

[Experiment 4] Reduce yieldCount by a factor of 10-15 to account for the slow down in the PAUSE insn:
-Xthr:spin1=256,spin2=32,yield=4,tryEnterSpin1=256,tryEnterSpin2=32,tryEnterYield=4,threeTierSpinCount1=256,threeTierSpinCount2=32,threeTierSpinCount3=4

[Experiment 5] Reduce spinCount2 (by a factor of 3) and yieldCount (by a factor of 5) to account for 15x slow down in PAUSE:
 -Xthr:spin1=256,spin2=11,yield=9,tryEnterSpin1=256,tryEnterSpin2=11,tryEnterYield=9,threeTierSpinCount1=256,threeTierSpinCount2=11,threeTierSpinCount3=9

[Experiment 6] Double the number of PAUSE insn to verify regression:
 -Xthr:spin1=256,spin2=48,yield=60,tryEnterSpin1=256,tryEnterSpin2=48,tryEnterYield=60,threeTierSpinCount1=256,threeTierSpinCount2=48,threeTierSpinCount3=60

[Experiment 7] Default parameters

Benchmarks to be used

Bumblebench will be used as the framework for these experiments: https://github.com/AdoptOpenJDK/bumblebench/tree/master/net/adoptopenjdk/bumblebench
Specific benchmark TBD

Machines to be used

TBD, but will involve one Skylake machine and one pre-Skylake.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions