More x86 IR JIT #17975

unknownbrackets · 2023-08-25T06:32:39Z

Didn't do divide yet, but otherwise this mostly covers the ALU stuff. Still some float and vec stuff that gets hit.

This also does multiplies the same way as "modern" 64-bit backends, with a single register. This does make madd/etc. simpler. This decreases the average bloat further.

-[Unknown]

hrydgard · 2023-08-25T07:08:10Z

Core/MIPS/x86/X64IRCompALU.cpp

@@ -177,12 +177,22 @@ void X64JitBackend::CompIR_Bits(IRInst inst) {
 void X64JitBackend::CompIR_Compare(IRInst inst) {
 	CONDITIONAL_DISABLE;

+	auto setCC = [&](const OpArg &arg, CCFlags cc) {


Ah, you did it anyway :) Also nice to centralize the logic in this lambda.

hrydgard · 2023-08-25T07:09:26Z

Core/MIPS/x86/X64IRCompALU.cpp

+		MOV(64, R(SCRATCH1), Imm64(0xFFFFFFFF00000000ULL));
+		AND(64, regs_.R(IRREG_LO), R(SCRATCH1));


I wonder if this is better than shifting into a wall and back? The latter is certainly smaller in bytes.

Yeah, I'm not really sure. I imagine this is pretty fast, but yes the shifts would be about 2 bytes smaller I think?

-[Unknown]

That 64-bit MOV alone is 10 bytes I think, so 12 or 13 with the AND? And shifts would be 6 bytes in total. Or I suppose 8 since rex bytes are needed...

So yeah, not a huge difference in the big picture. 6-7 bytes saved..

Oh right, I forgot the AND, which needs the REX as well. I don't think MtLo is very common though, maybe I should just switch to shifts... probably won't impact much anyway.

-[Unknown]

hrydgard · 2023-08-25T07:14:59Z

Core/MIPS/x86/X64IRCompVec.cpp

+		if (cpu_info.bSSE4_1 && inst.dest == inst.src1) {
+			DPPS(regs_.FX(inst.dest), regs_.F(inst.src2), 0xF1);


Hm, didn't we talk a while ago about DPPS being slow or bad or something? Still, likely fine.

Well, I've found it to be slow, but I think fp64's tests were that it's basically the same as doing it manually but less code. So I guess that's fine.

-[Unknown]

Actually, in #17571 I only posted tests for Dot33 (not sure how good they were).
I was mostly going by https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026 .
The "dpps is neither faster nor slower" was something I heard but not verified.
The tests I ran just now seem to disagree:

CPU: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz dot_sse1 : 1.259 ns/dot dot_sse3 : 1.077 ns/dot dot_hadd : 1.338 ns/dot dot_dpps : 1.949 ns/dot CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz dot_sse1 : 0.883 ns/dot dot_sse3 : 0.872 ns/dot dot_hadd : 1.218 ns/dot dot_dpps : 1.754 ns/dot CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz dot_sse1 : 2.059 ns/dot dot_sse3 : 1.910 ns/dot dot_hadd : 2.211 ns/dot dot_dpps : 2.928 ns/dot CPU: AMD EPYC 7R13 Processor dot_sse1 : 1.119 ns/dot dot_sse3 : 1.263 ns/dot dot_hadd : 1.526 ns/dot dot_dpps : 1.445 ns/dot CPU: AMD EPYC 7571 dot_sse1 : 2.307 ns/dot dot_sse3 : 1.975 ns/dot dot_hadd : 2.585 ns/dot dot_dpps : 2.726 ns/dot

Hope I didn't mess up the tests - feel free to take a look at the code.

Yeah there indeed it looks like dot_sse1/sse3 is the way to go. So we probably should, although not exactly urgent in this case here.

Alright, I'll revert my memory back to "dpps is terrible".

On my CPU, SSE1 (in context of actual jit) is actually noticeably faster than SSE3 - both are much faster than DPPS. I think I'll just go with that for now and we can look at tuning for SSE3 in certain cases someday.

-[Unknown]

By the way, @fp64, if you're interested I'd be curious to know how the new IR based jit performs for you. Would be best with all the recent PRs merged. There's still some unimplemented ops, but most of the common ones are there and it may be faster.

-[Unknown]

By the way, @fp64, if you're interested I'd be curious to know how the new IR based jit performs for you.

Sorry, just now noticed, @unknownbrackets. Well, as of 0db3266 all games tested ("Valkyria Chronicles II", "Valkyria Chronicles 3", "Super Robot Taisen A", "Hexyz Force", "Yggdra Union") have fairly significant missing and/or garbled graphics with "JIT Using IR" (unlike other 3 modes), but speed-wise they seem to be on par with JIT. Nothing crashes, too.
As an example of graphical issues, Valkyria Chronicles II just shows black screen at game start (you can advance, though), and during mission draws just its white vignette thing over otherwise black screen. VC3 is the same way.
Some glitches are fixed by switching to JIT mid-game, some aren't.
Mostly 3D stuff seems to be affected.
5b9bdc1 has the same issues, haven't tested earlier.

This also happens on Win32 build from buildbot (tested on fb520e5, i.e. https://buildbot.orphis.net/ppsspp/index.php?m=dl&rev=v1.15.4-1325-gfb520e58d&platform=windows-x86).

Sorry, I must not have tested 32-bit when I adjusted the temp vec4 reg clobber. This should fix that, dumb typo:
#18025

Good if speed is on par as of now. I have a change that helps some games be a bit faster with SIMD, but I haven't had much time since the weekend.

-[Unknown]

unknownbrackets added the x86jit x86/x64 JIT bugs label Aug 25, 2023

unknownbrackets added this to the v1.16.0 milestone Aug 25, 2023

unknownbrackets added 8 commits August 25, 2023 00:00

x86jit: Reduce code a bit in SETcc paths.

2fbdc42

x86jit: Implement syscalls and some system.

004c35c

x86jit: Implement fneg/abs.

bfb8df8

x86jit: Implement vdot.

601bf34

x86jit: Implement shifts.

363f2b6

x86jit: Implement multiplies.

d1a3033

x86jit: Implement conditional assigns.

d021706

x86jit: A few more float ops.

be4fe52

unknownbrackets force-pushed the x86-jit-ir branch from d3e9b5b to be4fe52 Compare August 25, 2023 07:01

hrydgard approved these changes Aug 25, 2023

View reviewed changes

hrydgard merged commit 308e983 into hrydgard:master Aug 25, 2023

unknownbrackets deleted the x86-jit-ir branch August 25, 2023 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More x86 IR JIT #17975

More x86 IR JIT #17975

unknownbrackets commented Aug 25, 2023

hrydgard Aug 25, 2023

hrydgard Aug 25, 2023

unknownbrackets Aug 25, 2023

hrydgard Aug 25, 2023 •

edited

Loading

unknownbrackets Aug 25, 2023

hrydgard Aug 25, 2023

unknownbrackets Aug 25, 2023

fp64 Aug 25, 2023

hrydgard Aug 25, 2023

unknownbrackets Aug 26, 2023

unknownbrackets Aug 27, 2023

fp64 Aug 30, 2023

fp64 Aug 30, 2023

unknownbrackets Aug 31, 2023

		MOV(64, R(SCRATCH1), Imm64(0xFFFFFFFF00000000ULL));
		AND(64, regs_.R(IRREG_LO), R(SCRATCH1));

		if (cpu_info.bSSE4_1 && inst.dest == inst.src1) {
		DPPS(regs_.FX(inst.dest), regs_.F(inst.src2), 0xF1);

More x86 IR JIT #17975

More x86 IR JIT #17975

Conversation

unknownbrackets commented Aug 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hrydgard Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hrydgard Aug 25, 2023 •

edited

Loading