Skip to content

More x86 IR JIT #17975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 25, 2023
Merged

More x86 IR JIT #17975

merged 8 commits into from
Aug 25, 2023

Conversation

unknownbrackets
Copy link
Collaborator

Didn't do divide yet, but otherwise this mostly covers the ALU stuff. Still some float and vec stuff that gets hit.

This also does multiplies the same way as "modern" 64-bit backends, with a single register. This does make madd/etc. simpler. This decreases the average bloat further.

-[Unknown]

@unknownbrackets unknownbrackets added the x86jit x86/x64 JIT bugs label Aug 25, 2023
@unknownbrackets unknownbrackets added this to the v1.16.0 milestone Aug 25, 2023
@@ -177,12 +177,22 @@ void X64JitBackend::CompIR_Bits(IRInst inst) {
void X64JitBackend::CompIR_Compare(IRInst inst) {
CONDITIONAL_DISABLE;

auto setCC = [&](const OpArg &arg, CCFlags cc) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you did it anyway :) Also nice to centralize the logic in this lambda.

Comment on lines +314 to +315
MOV(64, R(SCRATCH1), Imm64(0xFFFFFFFF00000000ULL));
AND(64, regs_.R(IRREG_LO), R(SCRATCH1));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is better than shifting into a wall and back? The latter is certainly smaller in bytes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not really sure. I imagine this is pretty fast, but yes the shifts would be about 2 bytes smaller I think?

-[Unknown]

Copy link
Owner

@hrydgard hrydgard Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That 64-bit MOV alone is 10 bytes I think, so 12 or 13 with the AND? And shifts would be 6 bytes in total. Or I suppose 8 since rex bytes are needed...

So yeah, not a huge difference in the big picture. 6-7 bytes saved..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I forgot the AND, which needs the REX as well. I don't think MtLo is very common though, maybe I should just switch to shifts... probably won't impact much anyway.

-[Unknown]

Comment on lines +274 to +275
if (cpu_info.bSSE4_1 && inst.dest == inst.src1) {
DPPS(regs_.FX(inst.dest), regs_.F(inst.src2), 0xF1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, didn't we talk a while ago about DPPS being slow or bad or something? Still, likely fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I've found it to be slow, but I think fp64's tests were that it's basically the same as doing it manually but less code. So I guess that's fine.

-[Unknown]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in #17571 I only posted tests for Dot33 (not sure how good they were).
I was mostly going by https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026 .
The "dpps is neither faster nor slower" was something I heard but not verified.
The tests I ran just now seem to disagree:

CPU: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
dot_sse1            : 1.259 ns/dot
dot_sse3            : 1.077 ns/dot
dot_hadd            : 1.338 ns/dot
dot_dpps            : 1.949 ns/dot

CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
dot_sse1            : 0.883 ns/dot
dot_sse3            : 0.872 ns/dot
dot_hadd            : 1.218 ns/dot
dot_dpps            : 1.754 ns/dot

CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
dot_sse1            : 2.059 ns/dot
dot_sse3            : 1.910 ns/dot
dot_hadd            : 2.211 ns/dot
dot_dpps            : 2.928 ns/dot

CPU: AMD EPYC 7R13 Processor
dot_sse1            : 1.119 ns/dot
dot_sse3            : 1.263 ns/dot
dot_hadd            : 1.526 ns/dot
dot_dpps            : 1.445 ns/dot

CPU: AMD EPYC 7571
dot_sse1            : 2.307 ns/dot
dot_sse3            : 1.975 ns/dot
dot_hadd            : 2.585 ns/dot
dot_dpps            : 2.726 ns/dot

Hope I didn't mess up the tests - feel free to take a look at the code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there indeed it looks like dot_sse1/sse3 is the way to go. So we probably should, although not exactly urgent in this case here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll revert my memory back to "dpps is terrible".

On my CPU, SSE1 (in context of actual jit) is actually noticeably faster than SSE3 - both are much faster than DPPS. I think I'll just go with that for now and we can look at tuning for SSE3 in certain cases someday.

-[Unknown]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, @fp64, if you're interested I'd be curious to know how the new IR based jit performs for you. Would be best with all the recent PRs merged. There's still some unimplemented ops, but most of the common ones are there and it may be faster.

-[Unknown]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, @fp64, if you're interested I'd be curious to know how the new IR based jit performs for you.

Sorry, just now noticed, @unknownbrackets. Well, as of 0db3266 all games tested ("Valkyria Chronicles II", "Valkyria Chronicles 3", "Super Robot Taisen A", "Hexyz Force", "Yggdra Union") have fairly significant missing and/or garbled graphics with "JIT Using IR" (unlike other 3 modes), but speed-wise they seem to be on par with JIT. Nothing crashes, too.
As an example of graphical issues, Valkyria Chronicles II just shows black screen at game start (you can advance, though), and during mission draws just its white vignette thing over otherwise black screen. VC3 is the same way.
Some glitches are fixed by switching to JIT mid-game, some aren't.
Mostly 3D stuff seems to be affected.
5b9bdc1 has the same issues, haven't tested earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must not have tested 32-bit when I adjusted the temp vec4 reg clobber. This should fix that, dumb typo:
#18025

Good if speed is on par as of now. I have a change that helps some games be a bit faster with SIMD, but I haven't had much time since the weekend.

-[Unknown]

@hrydgard hrydgard merged commit 308e983 into hrydgard:master Aug 25, 2023
@unknownbrackets unknownbrackets deleted the x86-jit-ir branch August 25, 2023 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
x86jit x86/x64 JIT bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants