Skip to content

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #14361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 18, 2025

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Mar 17, 2025

This PR tries another way to implement the idea of #13521, taking advantage of auto-vectorized loop to decode ints like we did in for bpv24 in #14203.

One thing need to be pointed out is that the remainder loop does not get vectorized (again!) since 512 / 3 = 170 is not a multiple of 8, then you see thefloorToMultipleOf8 trick .

Mac M2

Benchmark                                            Mode  Cnt    Score   Error   Units
Decode21Benchmark.decode21Scalar                    thrpt    5   92.405 ± 0.521  ops/ms
Decode21Benchmark.decode21Vector                    thrpt    5  108.325 ± 1.517  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8  thrpt    5  141.691 ± 3.948  ops/ms

Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (AVX 512)

Benchmark                                                Mode  Cnt   Score   Error   Units
Decode21Benchmark.decode21Scalar                        thrpt    5  29.134 ? 0.087  ops/ms
Decode21Benchmark.decode21Scalar:asm                    thrpt          NaN             ---
Decode21Benchmark.decode21Vector                        thrpt    5  45.180 ? 0.479  ops/ms
Decode21Benchmark.decode21Vector:asm                    thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf8      thrpt    5  76.330 ? 2.858  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm  thrpt          NaN             ---

cc @expani who raised this neat idea.

@gf2121
Copy link
Contributor Author

gf2121 commented Mar 17, 2025

Results on wikimediumall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet       94.25      (1.4%)       93.50      (1.1%)   -0.8% (  -3% -    1%) 0.069
             CountFilteredIntNRQ       43.06      (2.1%)       43.27      (2.2%)    0.5% (  -3% -    4%) 0.529
                  FilteredIntNRQ       79.65      (3.1%)       80.63      (3.4%)    1.2% (  -5% -    8%) 0.285
                          IntNRQ       81.27      (3.6%)       82.37      (3.8%)    1.3% (  -5% -    9%) 0.311
  baselline candidate diff
_32.kdd 58829701 58713421 -0.20%
_65.kdd 63532709 63171629 -0.57%
_98.kdd 64475136 63986556 -0.76%
_cb.kdd 64765401 64210351 -0.86%
_fe.kdd 63945666 63346246 -0.94%
_fp.kdd 6619883 6094047 -7.94%
_g0.kdd 6422847 5915855 -7.89%
_gb.kdd 6483860 5981968 -7.74%
_gm.kdd 6494743 6005943 -7.53%
_gx.kdd 6253202 5811974 -7.06%
_gy.kdd 522267 522267 0.00%
_gz.kdd 521444 521444 0.00%
_h0.kdd 522729 522729 0.00%
_h1.kdd 504805 504805 0.00%
_h2.kdd 463669 463669 0.00%

@jpountz
Copy link
Contributor

jpountz commented Mar 17, 2025

Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well? (By the way, which of your machine produced the above benchmark results?) Otherwise, the change makes sense to me.

@gf2121
Copy link
Contributor Author

gf2121 commented Mar 17, 2025

Thanks for feedback,

Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well?

That is what i thought initially. But my AVX-512 machine (hopefully it is) somehow only deals 256 bit once for its vpand and vpslld instructions so flooring to multiple of 8 works on it. Flooring to multiple of 16 makes benchmark a bit slower on my machine as more remaining longs need to be read but i'm fine to floor it to multiple of 16 there are machines requiring that.

Benchmark                                                 Mode  Cnt   Score   Error   Units
Decode21Benchmark.decode21Scalar                         thrpt    5  28.114 ? 0.013  ops/ms
Decode21Benchmark.decode21Scalar:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21Vector                         thrpt    5  49.160 ? 1.661  ops/ms
Decode21Benchmark.decode21Vector:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf16      thrpt    5  74.828 ? 0.463  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf16:asm  thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf8       thrpt    5  81.078 ? 0.397  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm   thrpt          NaN             ---

cpu flags:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni arch_capabilities

By the way, which of your machine produced the above benchmark results?

The luceneutil results get on the intel chip (Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (AVX 512)).

@gf2121
Copy link
Contributor Author

gf2121 commented Mar 17, 2025

OK i get expected results that multiple of 16 faster than multiple of 8 when i force -XX:UseAVX=3, it can be seen AVX3 is slower on this chip, that might be why java disabled it by default.

Benchmark                                                 Mode  Cnt   Score   Error   Units
Decode21Benchmark.decode21Scalar                         thrpt    5  28.375 ? 0.064  ops/ms
Decode21Benchmark.decode21Scalar:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21Vector                         thrpt    5  41.844 ? 0.182  ops/ms
Decode21Benchmark.decode21Vector:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf16      thrpt    5  64.471 ? 0.218  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf16:asm  thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf8       thrpt    5  39.665 ? 0.120  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm   thrpt          NaN             ---

@gf2121 gf2121 merged commit a2c6d3a into apache:main Mar 18, 2025
4 of 7 checks passed
@gf2121 gf2121 added this to the 10.2.0 milestone Mar 19, 2025
jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 24, 2025
jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants