GH-46788: [C++] Enable SIMD for byte stream split with 2 streams #46789

AntoinePrv · 2025-06-12T08:27:45Z

Rationale for this change

Performance improvements for split stream encoding with two streams.
f16 are often used in machine learning for instance.

What changes are included in this PR?

ByteStreamSplitDecodeSimd128 was a straightforward beneficial change.
ByteStreamSplitEncodeSimd128 was significantly refactor to make it more generic. With the new implementation, we can investigate merging it with the avx2 version.

Are these changes tested?

Yes with existing tests.

Are there any user-facing changes?

No.

GitHub Issue: [C++] Enable SIMD for Byte Stream Split with 2 streams #46788

github-actions · 2025-06-12T08:28:10Z

⚠️ GitHub issue #46788 has been automatically assigned in GitHub to PR creator.

AntoinePrv · 2025-06-12T08:31:35Z

Benchmark result on my Macbook Pro M3:
archery benchmark diff --suite-filter=parquet-encoding --benchmark-filter='ByteStreamSplit'

Show benchmark results

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (44)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                      benchmark       baseline      contender  change %                                                                                                                                                                                       counters
 BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  6.967 GiB/sec 96.028 GiB/sec  1278.358  {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 40039}
  BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  7.008 GiB/sec 80.835 GiB/sec  1053.458 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2561954}
 BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  5.514 GiB/sec 15.146 GiB/sec   174.667  {'family_index': 7, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 31055}
  BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  5.527 GiB/sec 14.929 GiB/sec   170.123 {'family_index': 7, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2072177}
   BM_ByteStreamSplitEncode_Double_Generic/1024 10.424 GiB/sec 17.349 GiB/sec    66.429   {'family_index': 6, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 961842}
      BM_ByteStreamSplitEncode_Double_Neon/4096 10.955 GiB/sec 17.807 GiB/sec    62.539     {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 252562}
      BM_ByteStreamSplitEncode_Double_Neon/1024 10.984 GiB/sec 17.760 GiB/sec    61.688     {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 996966}
      BM_ByteStreamSplitEncode_Float_Neon/32768  8.316 GiB/sec 12.530 GiB/sec    50.674      {'family_index': 16, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 48757}
       BM_ByteStreamSplitEncode_Float_Neon/1024  8.517 GiB/sec 12.788 GiB/sec    50.146     {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1559934}
       BM_ByteStreamSplitEncode_Float_Neon/4096  8.585 GiB/sec 12.839 GiB/sec    49.545      {'family_index': 16, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 393889}
      BM_ByteStreamSplitEncode_Float_Neon/65536  8.379 GiB/sec 12.405 GiB/sec    48.038      {'family_index': 16, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 24227}
   BM_ByteStreamSplitEncode_Float_Generic/65536  8.378 GiB/sec 12.370 GiB/sec    47.660    {'family_index': 5, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 24709}
    BM_ByteStreamSplitEncode_Float_Generic/1024  8.725 GiB/sec 12.785 GiB/sec    46.533   {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1601790}
  BM_ByteStreamSplitEncode_Double_Generic/65536  8.978 GiB/sec 12.445 GiB/sec    38.616   {'family_index': 6, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13637}
     BM_ByteStreamSplitEncode_Double_Neon/32768  9.748 GiB/sec 13.113 GiB/sec    34.513     {'family_index': 17, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 27479}
     BM_ByteStreamSplitEncode_Double_Neon/65536  9.869 GiB/sec 12.707 GiB/sec    28.754     {'family_index': 17, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13956}
       BM_ByteStreamSplitDecode_Float_Neon/1024 11.112 GiB/sec 11.741 GiB/sec     5.661     {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2107120}
      BM_ByteStreamSplitDecode_Float_Neon/65536 11.203 GiB/sec 11.668 GiB/sec     4.150      {'family_index': 14, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32359}
      BM_ByteStreamSplitDecode_Float_Neon/32768 11.282 GiB/sec 11.697 GiB/sec     3.683      {'family_index': 14, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 64967}
   BM_ByteStreamSplitDecode_Float_Generic/65536 11.235 GiB/sec 11.541 GiB/sec     2.724    {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32346}
       BM_ByteStreamSplitDecode_Float_Neon/4096 11.636 GiB/sec 11.951 GiB/sec     2.703      {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 529601}
   BM_ByteStreamSplitEncode_Double_Scalar/65536  5.630 GiB/sec  5.708 GiB/sec     1.396    {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7926}
  BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024  5.782 GiB/sec  5.855 GiB/sec     1.257  {'family_index': 8, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 610900}
    BM_ByteStreamSplitDecode_Float_Generic/1024 11.655 GiB/sec 11.797 GiB/sec     1.222   {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2142232}
    BM_ByteStreamSplitDecode_Float_Scalar/65536  6.898 GiB/sec  6.930 GiB/sec     0.466    {'family_index': 10, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19864}
      BM_ByteStreamSplitDecode_Double_Neon/4096  7.516 GiB/sec  7.543 GiB/sec     0.362     {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 171879}
 BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024  5.877 GiB/sec  5.889 GiB/sec     0.212 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 270124}
    BM_ByteStreamSplitEncode_Double_Scalar/1024  5.799 GiB/sec  5.808 GiB/sec     0.159   {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 532964}
   BM_ByteStreamSplitDecode_Double_Generic/1024  7.526 GiB/sec  7.521 GiB/sec    -0.056   {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 691256}
     BM_ByteStreamSplitDecode_Double_Neon/32768  7.300 GiB/sec  7.296 GiB/sec    -0.058     {'family_index': 15, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 20832}
     BM_ByteStreamSplitDecode_Float_Scalar/1024  6.897 GiB/sec  6.891 GiB/sec    -0.079   {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1272218}
 BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536  5.760 GiB/sec  5.755 GiB/sec    -0.091   {'family_index': 8, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9473}
BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536  3.345 GiB/sec  3.340 GiB/sec    -0.154  {'family_index': 9, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2353}
      BM_ByteStreamSplitDecode_Double_Neon/1024  7.551 GiB/sec  7.529 GiB/sec    -0.289     {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 671811}
 BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024  5.874 GiB/sec  5.855 GiB/sec    -0.322 {'family_index': 9, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 267179}
    BM_ByteStreamSplitEncode_Float_Scalar/65536  5.694 GiB/sec  5.673 GiB/sec    -0.357    {'family_index': 12, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16373}
   BM_ByteStreamSplitDecode_Double_Scalar/65536  6.361 GiB/sec  6.338 GiB/sec    -0.373    {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9129}
  BM_ByteStreamSplitDecode_Double_Generic/65536  7.312 GiB/sec  7.280 GiB/sec    -0.448   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10472}
    BM_ByteStreamSplitDecode_Double_Scalar/1024  6.890 GiB/sec  6.856 GiB/sec    -0.484   {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 629468}
     BM_ByteStreamSplitEncode_Float_Scalar/1024  5.791 GiB/sec  5.757 GiB/sec    -0.581   {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1062070}
  BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024  6.877 GiB/sec  6.825 GiB/sec    -0.751  {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 720046}
     BM_ByteStreamSplitDecode_Double_Neon/65536  7.319 GiB/sec  7.253 GiB/sec    -0.906     {'family_index': 15, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10472}
 BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536  6.619 GiB/sec  6.547 GiB/sec    -1.080  {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10868}
BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536  5.438 GiB/sec  5.358 GiB/sec    -1.470  {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3899}

raulcd · 2025-06-12T08:54:40Z

@ursabot please benchmark

ursabot · 2025-06-12T08:54:46Z

Benchmark runs are scheduled for commit ca35643. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-06-12T10:03:43Z

Here are the benchmark numbers on my local machine (AMD Ryzen 9 3900X, gcc 14.3):

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (42)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                      benchmark       baseline      contender  change %                                                                                                                                                                                       counters
  BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  3.828 GiB/sec 60.992 GiB/sec  1493.306 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1391248}
 BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  3.833 GiB/sec 54.779 GiB/sec  1329.148  {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22088}
  BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  4.412 GiB/sec  7.365 GiB/sec    66.918 {'family_index': 7, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1693608}
 BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  4.584 GiB/sec  7.482 GiB/sec    63.227  {'family_index': 7, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 26266}
   BM_ByteStreamSplitEncode_Double_Generic/1024  7.398 GiB/sec  8.593 GiB/sec    16.151   {'family_index': 6, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 703188}
      BM_ByteStreamSplitEncode_Double_Sse2/1024  7.449 GiB/sec  8.622 GiB/sec    15.743     {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 730024}
      BM_ByteStreamSplitEncode_Double_Avx2/1024  7.522 GiB/sec  8.611 GiB/sec    14.487     {'family_index': 21, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 683492}
  BM_ByteStreamSplitEncode_Double_Generic/65536  7.098 GiB/sec  7.898 GiB/sec    11.280   {'family_index': 6, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10152}
     BM_ByteStreamSplitEncode_Double_Avx2/65536  7.343 GiB/sec  7.867 GiB/sec     7.130     {'family_index': 21, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10406}
     BM_ByteStreamSplitEncode_Double_Sse2/65536  7.451 GiB/sec  7.867 GiB/sec     5.583     {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10614}
BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536  3.661 GiB/sec  3.858 GiB/sec     5.388  {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2655}
   BM_ByteStreamSplitEncode_Double_Scalar/65536  4.831 GiB/sec  5.087 GiB/sec     5.282    {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7018}
       BM_ByteStreamSplitDecode_Float_Sse2/1024  7.210 GiB/sec  7.584 GiB/sec     5.176     {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1313805}
    BM_ByteStreamSplitDecode_Double_Scalar/1024  3.866 GiB/sec  4.065 GiB/sec     5.160   {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 356025}
   BM_ByteStreamSplitDecode_Double_Scalar/65536  3.806 GiB/sec  3.995 GiB/sec     4.977    {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5438}
      BM_ByteStreamSplitDecode_Float_Avx2/65536 18.926 GiB/sec 19.850 GiB/sec     4.884      {'family_index': 18, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 54121}
    BM_ByteStreamSplitDecode_Float_Scalar/65536  3.882 GiB/sec  4.065 GiB/sec     4.724    {'family_index': 10, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11103}
    BM_ByteStreamSplitEncode_Double_Scalar/1024  4.989 GiB/sec  5.217 GiB/sec     4.565   {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 455047}
 BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024  3.847 GiB/sec  4.018 GiB/sec     4.443 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 175342}
     BM_ByteStreamSplitDecode_Float_Scalar/1024  3.888 GiB/sec  4.059 GiB/sec     4.406    {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 710875}
 BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536  4.733 GiB/sec  4.930 GiB/sec     4.161   {'family_index': 8, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7786}
  BM_ByteStreamSplitDecode_Double_Generic/65536 12.680 GiB/sec 13.207 GiB/sec     4.155   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 18332}
     BM_ByteStreamSplitDecode_Double_Avx2/65536 12.736 GiB/sec 13.263 GiB/sec     4.134     {'family_index': 19, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 18314}
       BM_ByteStreamSplitDecode_Float_Avx2/1024 18.630 GiB/sec 19.393 GiB/sec     4.094     {'family_index': 18, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3463316}
      BM_ByteStreamSplitEncode_Float_Avx2/65536 12.804 GiB/sec 13.321 GiB/sec     4.036      {'family_index': 20, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36806}
    BM_ByteStreamSplitDecode_Float_Generic/1024 18.800 GiB/sec 19.559 GiB/sec     4.035   {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3427881}
  BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024  3.809 GiB/sec  3.958 GiB/sec     3.906  {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 400054}
  BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024  4.822 GiB/sec  5.010 GiB/sec     3.900  {'family_index': 8, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 506548}
   BM_ByteStreamSplitDecode_Float_Generic/65536 19.346 GiB/sec 20.089 GiB/sec     3.838    {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 55111}
      BM_ByteStreamSplitDecode_Double_Avx2/1024 13.357 GiB/sec 13.862 GiB/sec     3.777    {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1204226}
     BM_ByteStreamSplitEncode_Float_Scalar/1024  4.955 GiB/sec  5.137 GiB/sec     3.678    {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 906261}
   BM_ByteStreamSplitEncode_Float_Generic/65536 12.745 GiB/sec 13.189 GiB/sec     3.480    {'family_index': 5, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36559}
 BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536  3.807 GiB/sec  3.938 GiB/sec     3.429   {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6135}
 BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024  4.889 GiB/sec  5.056 GiB/sec     3.411 {'family_index': 9, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 224832}
      BM_ByteStreamSplitDecode_Float_Sse2/65536  7.650 GiB/sec  7.907 GiB/sec     3.355      {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 21896}
       BM_ByteStreamSplitEncode_Float_Avx2/1024 12.565 GiB/sec 12.949 GiB/sec     3.052     {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2277598}
    BM_ByteStreamSplitEncode_Float_Scalar/65536  4.927 GiB/sec  5.073 GiB/sec     2.976    {'family_index': 12, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14124}
BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536  4.739 GiB/sec  4.866 GiB/sec     2.672  {'family_index': 9, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3369}
    BM_ByteStreamSplitEncode_Float_Generic/1024 12.523 GiB/sec 12.857 GiB/sec     2.667   {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2296580}
   BM_ByteStreamSplitDecode_Double_Generic/1024 13.364 GiB/sec 13.672 GiB/sec     2.301  {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1222754}
     BM_ByteStreamSplitDecode_Double_Sse2/65536  8.820 GiB/sec  8.815 GiB/sec    -0.055     {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12677}
      BM_ByteStreamSplitDecode_Double_Sse2/1024  8.979 GiB/sec  8.958 GiB/sec    -0.234     {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 814584}

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (2)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                benchmark       baseline     contender  change %                                                                                                                                                                                   counters
BM_ByteStreamSplitEncode_Float_Sse2/65536 11.041 GiB/sec 8.209 GiB/sec   -25.647  {'family_index': 16, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 30890}
 BM_ByteStreamSplitEncode_Float_Sse2/1024 11.055 GiB/sec 8.168 GiB/sec   -26.114 {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2024308}

cpp/src/arrow/util/byte_stream_split_internal.h

pitrou · 2025-06-12T10:38:59Z

Now that we have SIMD optimizations for this, can we make sure the benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, AVX2, Neon).

conbench-apache-arrow · 2025-06-12T17:22:42Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit ca35643.

There were 72 benchmark results with an error:

Pull Request Run on arm64-t4g-2xlarge-linux at 2025-06-12 13:10:19Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1
and 70 more (see the report linked below)

There were 18 benchmark results indicating a performance regression:

Pull Request Run on arm64-t4g-2xlarge-linux at 2025-06-12 13:10:19Z
- BM_ByteStreamSplitEncode_Float_Neon (C++) with params=32768, source=cpp-micro, suite=parquet-encoding-benchmark
- BM_ByteStreamSplitEncode_Float_Neon (C++) with params=1024, source=cpp-micro, suite=parquet-encoding-benchmark
and 16 more (see the report linked below)

The full Conbench report has more details.

AntoinePrv · 2025-06-16T10:01:15Z

Now that we have SIMD optimizations for this, can we make sure the benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, AVX2, Neon).

I've done so with int16_t. Since it's gated behind SIMD macros, it should not be a problem.

AntoinePrv added 4 commits June 12, 2025 10:10

Enable ByteStreamSplitDecodeSimd128<2>

5816eaf

Refactor ByteStreamSplitEncodeSimd128 to be generic over num_streams

0b68f44

Remove disjunction in ByteStreamSplitEncodeSimd128

82ac6dc

Enable ByteStreamSplitEncodeSimd128<2>

2ebe8a5

github-actions bot added Component: C++ awaiting review Awaiting review labels Jun 12, 2025

Fix conversion warning

ca35643

Fmt

17fe80b

Shorten comments

dae5965

pitrou reviewed Jun 12, 2025

View reviewed changes

cpp/src/arrow/util/byte_stream_split_internal.h Outdated Show resolved Hide resolved

cpp/src/arrow/util/byte_stream_split_internal.h Outdated Show resolved Hide resolved

cpp/src/arrow/util/byte_stream_split_internal.h Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 12, 2025

AntoinePrv added 4 commits June 12, 2025 14:00

Use kPascalCase for constants

df3cc7f

Use static enabling of ByteStreamSplitEncodeSimd128<2>

d4fb898

Safer computation of simd batch size

2f37deb

Fix fail compilation

a7ff16a

AntoinePrv added 2 commits June 13, 2025 11:11

Small Encode improvement

83ebd5c

Add int16_t benchamarks

59ddd69

AntoinePrv requested a review from wgtmac as a code owner June 16, 2025 09:58

github-actions bot added the Component: Parquet label Jun 16, 2025

AntoinePrv added 2 commits June 19, 2025 10:28

Fix int16_t byte stream split benchmarks

c8765de

Remove misleading benchmarks

71f0dc0

AntoinePrv force-pushed the byte-stream branch from 914409d to 71f0dc0 Compare June 19, 2025 12:13

Fix casing

a405233

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-46788: [C++] Enable SIMD for byte stream split with 2 streams #46789

GH-46788: [C++] Enable SIMD for byte stream split with 2 streams #46789

AntoinePrv commented Jun 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 12, 2025

Uh oh!

AntoinePrv commented Jun 12, 2025 •

edited

Loading

Uh oh!

raulcd commented Jun 12, 2025

Uh oh!

ursabot commented Jun 12, 2025

Uh oh!

pitrou commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Jun 12, 2025

Uh oh!

conbench-apache-arrow bot commented Jun 12, 2025

Uh oh!

AntoinePrv commented Jun 16, 2025

Uh oh!

Uh oh!

GH-46788: [C++] Enable SIMD for byte stream split with 2 streams #46789

Are you sure you want to change the base?

GH-46788: [C++] Enable SIMD for byte stream split with 2 streams #46789

Conversation

AntoinePrv commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jun 12, 2025

Uh oh!

AntoinePrv commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raulcd commented Jun 12, 2025

Uh oh!

ursabot commented Jun 12, 2025

Uh oh!

pitrou commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Jun 12, 2025

Uh oh!

conbench-apache-arrow bot commented Jun 12, 2025

Uh oh!

AntoinePrv commented Jun 16, 2025

Uh oh!

Uh oh!

AntoinePrv commented Jun 12, 2025 •

edited

Loading

AntoinePrv commented Jun 12, 2025 •

edited

Loading