Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #14361

gf2121 · 2025-03-17T07:05:55Z

This PR tries another way to implement the idea of #13521, taking advantage of auto-vectorized loop to decode ints like we did in for bpv24 in #14203.

One thing need to be pointed out is that the remainder loop does not get vectorized (again!) since 512 / 3 = 170 is not a multiple of 8, then you see thefloorToMultipleOf8 trick .

Mac M2

Benchmark                                            Mode  Cnt    Score   Error   Units
Decode21Benchmark.decode21Scalar                    thrpt    5   92.405 ± 0.521  ops/ms
Decode21Benchmark.decode21Vector                    thrpt    5  108.325 ± 1.517  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8  thrpt    5  141.691 ± 3.948  ops/ms

Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (AVX 512)

Benchmark                                                Mode  Cnt   Score   Error   Units
Decode21Benchmark.decode21Scalar                        thrpt    5  29.134 ? 0.087  ops/ms
Decode21Benchmark.decode21Scalar:asm                    thrpt          NaN             ---
Decode21Benchmark.decode21Vector                        thrpt    5  45.180 ? 0.479  ops/ms
Decode21Benchmark.decode21Vector:asm                    thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf8      thrpt    5  76.330 ? 2.858  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm  thrpt          NaN             ---

cc @expani who raised this neat idea.

Co-Authored-by: expani <[email protected]>

gf2121 · 2025-03-17T11:08:11Z

Results on wikimediumall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet       94.25      (1.4%)       93.50      (1.1%)   -0.8% (  -3% -    1%) 0.069
             CountFilteredIntNRQ       43.06      (2.1%)       43.27      (2.2%)    0.5% (  -3% -    4%) 0.529
                  FilteredIntNRQ       79.65      (3.1%)       80.63      (3.4%)    1.2% (  -5% -    8%) 0.285
                          IntNRQ       81.27      (3.6%)       82.37      (3.8%)    1.3% (  -5% -    9%) 0.311

	baselline	candidate	diff
_32.kdd	58829701	58713421	-0.20%
_65.kdd	63532709	63171629	-0.57%
_98.kdd	64475136	63986556	-0.76%
_cb.kdd	64765401	64210351	-0.86%
_fe.kdd	63945666	63346246	-0.94%
_fp.kdd	6619883	6094047	-7.94%
_g0.kdd	6422847	5915855	-7.89%
_gb.kdd	6483860	5981968	-7.74%
_gm.kdd	6494743	6005943	-7.53%
_gx.kdd	6253202	5811974	-7.06%
_gy.kdd	522267	522267	0.00%
_gz.kdd	521444	521444	0.00%
_h0.kdd	522729	522729	0.00%
_h1.kdd	504805	504805	0.00%
_h2.kdd	463669	463669	0.00%

jpountz · 2025-03-17T11:30:15Z

Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well? (By the way, which of your machine produced the above benchmark results?) Otherwise, the change makes sense to me.

gf2121 · 2025-03-17T12:08:41Z

Thanks for feedback,

Should we floor to a multiple of 16 instead of 8 so that we have a perfect second loop with AVX-512 as well?

That is what i thought initially. But my AVX-512 machine (hopefully it is) somehow only deals 256 bit once for its vpand and vpslld instructions so flooring to multiple of 8 works on it. Flooring to multiple of 16 makes benchmark a bit slower on my machine as more remaining longs need to be read but i'm fine to floor it to multiple of 16 there are machines requiring that.

Benchmark                                                 Mode  Cnt   Score   Error   Units
Decode21Benchmark.decode21Scalar                         thrpt    5  28.114 ? 0.013  ops/ms
Decode21Benchmark.decode21Scalar:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21Vector                         thrpt    5  49.160 ? 1.661  ops/ms
Decode21Benchmark.decode21Vector:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf16      thrpt    5  74.828 ? 0.463  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf16:asm  thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf8       thrpt    5  81.078 ? 0.397  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm   thrpt          NaN             ---

cpu flags:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni arch_capabilities

By the way, which of your machine produced the above benchmark results?

The luceneutil results get on the intel chip (Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (AVX 512)).

gf2121 · 2025-03-17T12:18:46Z

OK i get expected results that multiple of 16 faster than multiple of 8 when i force -XX:UseAVX=3, it can be seen AVX3 is slower on this chip, that might be why java disabled it by default.

Benchmark                                                 Mode  Cnt   Score   Error   Units
Decode21Benchmark.decode21Scalar                         thrpt    5  28.375 ? 0.064  ops/ms
Decode21Benchmark.decode21Scalar:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21Vector                         thrpt    5  41.844 ? 0.182  ops/ms
Decode21Benchmark.decode21Vector:asm                     thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf16      thrpt    5  64.471 ? 0.218  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf16:asm  thrpt          NaN             ---
Decode21Benchmark.decode21VectorFloorToMultipleOf8       thrpt    5  39.665 ? 0.120  ops/ms
Decode21Benchmark.decode21VectorFloorToMultipleOf8:asm   thrpt          NaN             ---

…14361) Co-Authored-by: expani <[email protected]>

…pache#14361) Co-Authored-by: expani <[email protected]>

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree

2e9a202

Co-Authored-by: expani <[email protected]>

gf2121 requested a review from jpountz March 17, 2025 07:05

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 17, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 17, 2025

github-actions bot added the module:core/codecs label Mar 17, 2025

add license to make CI run

20ad43e

gf2121 added 2 commits March 17, 2025 20:26

floor to multiple of 16

85a9ed5

fix assert info

c0d3540

jpountz approved these changes Mar 17, 2025

View reviewed changes

gf2121 merged commit a2c6d3a into apache:main Mar 18, 2025
4 of 7 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Mar 18, 2025

asfgit pushed a commit that referenced this pull request Mar 18, 2025

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree (#…

01fc85d

…14361) Co-Authored-by: expani <[email protected]>

gf2121 mentioned this pull request Mar 18, 2025

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #13521

Closed

gf2121 added this to the 10.2.0 milestone Mar 19, 2025

jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 24, 2025

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree (a…

6513bff

…pache#14361) Co-Authored-by: expani <[email protected]>

jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 24, 2025

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree (a…

6933976

…pache#14361) Co-Authored-by: expani <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #14361

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #14361

Uh oh!

gf2121 commented Mar 17, 2025

Uh oh!

gf2121 commented Mar 17, 2025

Uh oh!

jpountz commented Mar 17, 2025

Uh oh!

gf2121 commented Mar 17, 2025

Uh oh!

gf2121 commented Mar 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #14361

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #14361

Uh oh!

Conversation

gf2121 commented Mar 17, 2025

Uh oh!

gf2121 commented Mar 17, 2025

Uh oh!

jpountz commented Mar 17, 2025

Uh oh!

gf2121 commented Mar 17, 2025

Uh oh!

gf2121 commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gf2121 commented Mar 17, 2025 •

edited

Loading