-
Notifications
You must be signed in to change notification settings - Fork 2.3k
More x86 IR JIT #17975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More x86 IR JIT #17975
Conversation
d3e9b5b
to
be4fe52
Compare
@@ -177,12 +177,22 @@ void X64JitBackend::CompIR_Bits(IRInst inst) { | |||
void X64JitBackend::CompIR_Compare(IRInst inst) { | |||
CONDITIONAL_DISABLE; | |||
|
|||
auto setCC = [&](const OpArg &arg, CCFlags cc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you did it anyway :) Also nice to centralize the logic in this lambda.
MOV(64, R(SCRATCH1), Imm64(0xFFFFFFFF00000000ULL)); | ||
AND(64, regs_.R(IRREG_LO), R(SCRATCH1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is better than shifting into a wall and back? The latter is certainly smaller in bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not really sure. I imagine this is pretty fast, but yes the shifts would be about 2 bytes smaller I think?
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That 64-bit MOV alone is 10 bytes I think, so 12 or 13 with the AND? And shifts would be 6 bytes in total. Or I suppose 8 since rex bytes are needed...
So yeah, not a huge difference in the big picture. 6-7 bytes saved..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I forgot the AND, which needs the REX as well. I don't think MtLo is very common though, maybe I should just switch to shifts... probably won't impact much anyway.
-[Unknown]
if (cpu_info.bSSE4_1 && inst.dest == inst.src1) { | ||
DPPS(regs_.FX(inst.dest), regs_.F(inst.src2), 0xF1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, didn't we talk a while ago about DPPS being slow or bad or something? Still, likely fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I've found it to be slow, but I think fp64's tests were that it's basically the same as doing it manually but less code. So I guess that's fine.
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in #17571 I only posted tests for Dot33 (not sure how good they were).
I was mostly going by https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026 .
The "dpps is neither faster nor slower" was something I heard but not verified.
The tests I ran just now seem to disagree:
CPU: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
dot_sse1 : 1.259 ns/dot
dot_sse3 : 1.077 ns/dot
dot_hadd : 1.338 ns/dot
dot_dpps : 1.949 ns/dot
CPU: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
dot_sse1 : 0.883 ns/dot
dot_sse3 : 0.872 ns/dot
dot_hadd : 1.218 ns/dot
dot_dpps : 1.754 ns/dot
CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
dot_sse1 : 2.059 ns/dot
dot_sse3 : 1.910 ns/dot
dot_hadd : 2.211 ns/dot
dot_dpps : 2.928 ns/dot
CPU: AMD EPYC 7R13 Processor
dot_sse1 : 1.119 ns/dot
dot_sse3 : 1.263 ns/dot
dot_hadd : 1.526 ns/dot
dot_dpps : 1.445 ns/dot
CPU: AMD EPYC 7571
dot_sse1 : 2.307 ns/dot
dot_sse3 : 1.975 ns/dot
dot_hadd : 2.585 ns/dot
dot_dpps : 2.726 ns/dot
Hope I didn't mess up the tests - feel free to take a look at the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there indeed it looks like dot_sse1/sse3 is the way to go. So we probably should, although not exactly urgent in this case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I'll revert my memory back to "dpps is terrible".
On my CPU, SSE1 (in context of actual jit) is actually noticeably faster than SSE3 - both are much faster than DPPS. I think I'll just go with that for now and we can look at tuning for SSE3 in certain cases someday.
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, @fp64, if you're interested I'd be curious to know how the new IR based jit performs for you. Would be best with all the recent PRs merged. There's still some unimplemented ops, but most of the common ones are there and it may be faster.
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, @fp64, if you're interested I'd be curious to know how the new IR based jit performs for you.
Sorry, just now noticed, @unknownbrackets. Well, as of 0db3266 all games tested ("Valkyria Chronicles II", "Valkyria Chronicles 3", "Super Robot Taisen A", "Hexyz Force", "Yggdra Union") have fairly significant missing and/or garbled graphics with "JIT Using IR" (unlike other 3 modes), but speed-wise they seem to be on par with JIT. Nothing crashes, too.
As an example of graphical issues, Valkyria Chronicles II just shows black screen at game start (you can advance, though), and during mission draws just its white vignette thing over otherwise black screen. VC3 is the same way.
Some glitches are fixed by switching to JIT mid-game, some aren't.
Mostly 3D stuff seems to be affected.
5b9bdc1 has the same issues, haven't tested earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also happens on Win32 build from buildbot (tested on fb520e5, i.e. https://buildbot.orphis.net/ppsspp/index.php?m=dl&rev=v1.15.4-1325-gfb520e58d&platform=windows-x86).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I must not have tested 32-bit when I adjusted the temp vec4 reg clobber. This should fix that, dumb typo:
#18025
Good if speed is on par as of now. I have a change that helps some games be a bit faster with SIMD, but I haven't had much time since the weekend.
-[Unknown]
Didn't do divide yet, but otherwise this mostly covers the ALU stuff. Still some float and vec stuff that gets hit.
This also does multiplies the same way as "modern" 64-bit backends, with a single register. This does make madd/etc. simpler. This decreases the average bloat further.
-[Unknown]