Skip to content

GH-46788: [C++] Enable SIMD for byte stream split with 2 streams #46789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

AntoinePrv
Copy link

@AntoinePrv AntoinePrv commented Jun 12, 2025

Rationale for this change

Performance improvements for split stream encoding with two streams.
f16 are often used in machine learning for instance.

What changes are included in this PR?

  • ByteStreamSplitDecodeSimd128 was a straightforward beneficial change.
  • ByteStreamSplitEncodeSimd128 was significantly refactor to make it more generic. With the new implementation, we can investigate merging it with the avx2 version.

Are these changes tested?

Yes with existing tests.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #46788 has been automatically assigned in GitHub to PR creator.

@AntoinePrv
Copy link
Author

AntoinePrv commented Jun 12, 2025

Benchmark result on my Macbook Pro M3:
archery benchmark diff --suite-filter=parquet-encoding --benchmark-filter='ByteStreamSplit'

Show benchmark results
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (44)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                      benchmark       baseline      contender  change %                                                                                                                                                                                       counters
 BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  6.967 GiB/sec 96.028 GiB/sec  1278.358  {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 40039}
  BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  7.008 GiB/sec 80.835 GiB/sec  1053.458 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2561954}
 BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  5.514 GiB/sec 15.146 GiB/sec   174.667  {'family_index': 7, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 31055}
  BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  5.527 GiB/sec 14.929 GiB/sec   170.123 {'family_index': 7, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2072177}
   BM_ByteStreamSplitEncode_Double_Generic/1024 10.424 GiB/sec 17.349 GiB/sec    66.429   {'family_index': 6, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 961842}
      BM_ByteStreamSplitEncode_Double_Neon/4096 10.955 GiB/sec 17.807 GiB/sec    62.539     {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 252562}
      BM_ByteStreamSplitEncode_Double_Neon/1024 10.984 GiB/sec 17.760 GiB/sec    61.688     {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 996966}
      BM_ByteStreamSplitEncode_Float_Neon/32768  8.316 GiB/sec 12.530 GiB/sec    50.674      {'family_index': 16, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 48757}
       BM_ByteStreamSplitEncode_Float_Neon/1024  8.517 GiB/sec 12.788 GiB/sec    50.146     {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1559934}
       BM_ByteStreamSplitEncode_Float_Neon/4096  8.585 GiB/sec 12.839 GiB/sec    49.545      {'family_index': 16, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 393889}
      BM_ByteStreamSplitEncode_Float_Neon/65536  8.379 GiB/sec 12.405 GiB/sec    48.038      {'family_index': 16, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitEncode_Float_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 24227}
   BM_ByteStreamSplitEncode_Float_Generic/65536  8.378 GiB/sec 12.370 GiB/sec    47.660    {'family_index': 5, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 24709}
    BM_ByteStreamSplitEncode_Float_Generic/1024  8.725 GiB/sec 12.785 GiB/sec    46.533   {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1601790}
  BM_ByteStreamSplitEncode_Double_Generic/65536  8.978 GiB/sec 12.445 GiB/sec    38.616   {'family_index': 6, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13637}
     BM_ByteStreamSplitEncode_Double_Neon/32768  9.748 GiB/sec 13.113 GiB/sec    34.513     {'family_index': 17, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 27479}
     BM_ByteStreamSplitEncode_Double_Neon/65536  9.869 GiB/sec 12.707 GiB/sec    28.754     {'family_index': 17, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitEncode_Double_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13956}
       BM_ByteStreamSplitDecode_Float_Neon/1024 11.112 GiB/sec 11.741 GiB/sec     5.661     {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2107120}
      BM_ByteStreamSplitDecode_Float_Neon/65536 11.203 GiB/sec 11.668 GiB/sec     4.150      {'family_index': 14, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32359}
      BM_ByteStreamSplitDecode_Float_Neon/32768 11.282 GiB/sec 11.697 GiB/sec     3.683      {'family_index': 14, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 64967}
   BM_ByteStreamSplitDecode_Float_Generic/65536 11.235 GiB/sec 11.541 GiB/sec     2.724    {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32346}
       BM_ByteStreamSplitDecode_Float_Neon/4096 11.636 GiB/sec 11.951 GiB/sec     2.703      {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 529601}
   BM_ByteStreamSplitEncode_Double_Scalar/65536  5.630 GiB/sec  5.708 GiB/sec     1.396    {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7926}
  BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024  5.782 GiB/sec  5.855 GiB/sec     1.257  {'family_index': 8, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 610900}
    BM_ByteStreamSplitDecode_Float_Generic/1024 11.655 GiB/sec 11.797 GiB/sec     1.222   {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2142232}
    BM_ByteStreamSplitDecode_Float_Scalar/65536  6.898 GiB/sec  6.930 GiB/sec     0.466    {'family_index': 10, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19864}
      BM_ByteStreamSplitDecode_Double_Neon/4096  7.516 GiB/sec  7.543 GiB/sec     0.362     {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/4096', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 171879}
 BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024  5.877 GiB/sec  5.889 GiB/sec     0.212 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 270124}
    BM_ByteStreamSplitEncode_Double_Scalar/1024  5.799 GiB/sec  5.808 GiB/sec     0.159   {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 532964}
   BM_ByteStreamSplitDecode_Double_Generic/1024  7.526 GiB/sec  7.521 GiB/sec    -0.056   {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 691256}
     BM_ByteStreamSplitDecode_Double_Neon/32768  7.300 GiB/sec  7.296 GiB/sec    -0.058     {'family_index': 15, 'per_family_instance_index': 2, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/32768', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 20832}
     BM_ByteStreamSplitDecode_Float_Scalar/1024  6.897 GiB/sec  6.891 GiB/sec    -0.079   {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1272218}
 BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536  5.760 GiB/sec  5.755 GiB/sec    -0.091   {'family_index': 8, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9473}
BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536  3.345 GiB/sec  3.340 GiB/sec    -0.154  {'family_index': 9, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2353}
      BM_ByteStreamSplitDecode_Double_Neon/1024  7.551 GiB/sec  7.529 GiB/sec    -0.289     {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 671811}
 BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024  5.874 GiB/sec  5.855 GiB/sec    -0.322 {'family_index': 9, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 267179}
    BM_ByteStreamSplitEncode_Float_Scalar/65536  5.694 GiB/sec  5.673 GiB/sec    -0.357    {'family_index': 12, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16373}
   BM_ByteStreamSplitDecode_Double_Scalar/65536  6.361 GiB/sec  6.338 GiB/sec    -0.373    {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 9129}
  BM_ByteStreamSplitDecode_Double_Generic/65536  7.312 GiB/sec  7.280 GiB/sec    -0.448   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10472}
    BM_ByteStreamSplitDecode_Double_Scalar/1024  6.890 GiB/sec  6.856 GiB/sec    -0.484   {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 629468}
     BM_ByteStreamSplitEncode_Float_Scalar/1024  5.791 GiB/sec  5.757 GiB/sec    -0.581   {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1062070}
  BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024  6.877 GiB/sec  6.825 GiB/sec    -0.751  {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 720046}
     BM_ByteStreamSplitDecode_Double_Neon/65536  7.319 GiB/sec  7.253 GiB/sec    -0.906     {'family_index': 15, 'per_family_instance_index': 3, 'run_name': 'BM_ByteStreamSplitDecode_Double_Neon/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10472}
 BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536  6.619 GiB/sec  6.547 GiB/sec    -1.080  {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10868}
BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536  5.438 GiB/sec  5.358 GiB/sec    -1.470  {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3899}

@raulcd
Copy link
Member

raulcd commented Jun 12, 2025

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Jun 12, 2025

Benchmark runs are scheduled for commit ca35643. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member

pitrou commented Jun 12, 2025

Here are the benchmark numbers on my local machine (AMD Ryzen 9 3900X, gcc 14.3):

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (42)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                      benchmark       baseline      contender  change %                                                                                                                                                                                       counters
  BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024  3.828 GiB/sec 60.992 GiB/sec  1493.306 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1391248}
 BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536  3.833 GiB/sec 54.779 GiB/sec  1329.148  {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22088}
  BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024  4.412 GiB/sec  7.365 GiB/sec    66.918 {'family_index': 7, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1693608}
 BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536  4.584 GiB/sec  7.482 GiB/sec    63.227  {'family_index': 7, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<2>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 26266}
   BM_ByteStreamSplitEncode_Double_Generic/1024  7.398 GiB/sec  8.593 GiB/sec    16.151   {'family_index': 6, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 703188}
      BM_ByteStreamSplitEncode_Double_Sse2/1024  7.449 GiB/sec  8.622 GiB/sec    15.743     {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 730024}
      BM_ByteStreamSplitEncode_Double_Avx2/1024  7.522 GiB/sec  8.611 GiB/sec    14.487     {'family_index': 21, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 683492}
  BM_ByteStreamSplitEncode_Double_Generic/65536  7.098 GiB/sec  7.898 GiB/sec    11.280   {'family_index': 6, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10152}
     BM_ByteStreamSplitEncode_Double_Avx2/65536  7.343 GiB/sec  7.867 GiB/sec     7.130     {'family_index': 21, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10406}
     BM_ByteStreamSplitEncode_Double_Sse2/65536  7.451 GiB/sec  7.867 GiB/sec     5.583     {'family_index': 17, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10614}
BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536  3.661 GiB/sec  3.858 GiB/sec     5.388  {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2655}
   BM_ByteStreamSplitEncode_Double_Scalar/65536  4.831 GiB/sec  5.087 GiB/sec     5.282    {'family_index': 13, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7018}
       BM_ByteStreamSplitDecode_Float_Sse2/1024  7.210 GiB/sec  7.584 GiB/sec     5.176     {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1313805}
    BM_ByteStreamSplitDecode_Double_Scalar/1024  3.866 GiB/sec  4.065 GiB/sec     5.160   {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 356025}
   BM_ByteStreamSplitDecode_Double_Scalar/65536  3.806 GiB/sec  3.995 GiB/sec     4.977    {'family_index': 11, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5438}
      BM_ByteStreamSplitDecode_Float_Avx2/65536 18.926 GiB/sec 19.850 GiB/sec     4.884      {'family_index': 18, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 54121}
    BM_ByteStreamSplitDecode_Float_Scalar/65536  3.882 GiB/sec  4.065 GiB/sec     4.724    {'family_index': 10, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11103}
    BM_ByteStreamSplitEncode_Double_Scalar/1024  4.989 GiB/sec  5.217 GiB/sec     4.565   {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Double_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 455047}
 BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024  3.847 GiB/sec  4.018 GiB/sec     4.443 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 175342}
     BM_ByteStreamSplitDecode_Float_Scalar/1024  3.888 GiB/sec  4.059 GiB/sec     4.406    {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 710875}
 BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536  4.733 GiB/sec  4.930 GiB/sec     4.161   {'family_index': 8, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7786}
  BM_ByteStreamSplitDecode_Double_Generic/65536 12.680 GiB/sec 13.207 GiB/sec     4.155   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 18332}
     BM_ByteStreamSplitDecode_Double_Avx2/65536 12.736 GiB/sec 13.263 GiB/sec     4.134     {'family_index': 19, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 18314}
       BM_ByteStreamSplitDecode_Float_Avx2/1024 18.630 GiB/sec 19.393 GiB/sec     4.094     {'family_index': 18, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3463316}
      BM_ByteStreamSplitEncode_Float_Avx2/65536 12.804 GiB/sec 13.321 GiB/sec     4.036      {'family_index': 20, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Avx2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36806}
    BM_ByteStreamSplitDecode_Float_Generic/1024 18.800 GiB/sec 19.559 GiB/sec     4.035   {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3427881}
  BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024  3.809 GiB/sec  3.958 GiB/sec     3.906  {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 400054}
  BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024  4.822 GiB/sec  5.010 GiB/sec     3.900  {'family_index': 8, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<7>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 506548}
   BM_ByteStreamSplitDecode_Float_Generic/65536 19.346 GiB/sec 20.089 GiB/sec     3.838    {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 55111}
      BM_ByteStreamSplitDecode_Double_Avx2/1024 13.357 GiB/sec 13.862 GiB/sec     3.777    {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1204226}
     BM_ByteStreamSplitEncode_Float_Scalar/1024  4.955 GiB/sec  5.137 GiB/sec     3.678    {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 906261}
   BM_ByteStreamSplitEncode_Float_Generic/65536 12.745 GiB/sec 13.189 GiB/sec     3.480    {'family_index': 5, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 36559}
 BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536  3.807 GiB/sec  3.938 GiB/sec     3.429   {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_FLBA_Generic<7>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6135}
 BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024  4.889 GiB/sec  5.056 GiB/sec     3.411 {'family_index': 9, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 224832}
      BM_ByteStreamSplitDecode_Float_Sse2/65536  7.650 GiB/sec  7.907 GiB/sec     3.355      {'family_index': 14, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Float_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 21896}
       BM_ByteStreamSplitEncode_Float_Avx2/1024 12.565 GiB/sec 12.949 GiB/sec     3.052     {'family_index': 20, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Avx2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2277598}
    BM_ByteStreamSplitEncode_Float_Scalar/65536  4.927 GiB/sec  5.073 GiB/sec     2.976    {'family_index': 12, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Scalar/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14124}
BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536  4.739 GiB/sec  4.866 GiB/sec     2.672  {'family_index': 9, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_FLBA_Generic<16>/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3369}
    BM_ByteStreamSplitEncode_Float_Generic/1024 12.523 GiB/sec 12.857 GiB/sec     2.667   {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2296580}
   BM_ByteStreamSplitDecode_Double_Generic/1024 13.364 GiB/sec 13.672 GiB/sec     2.301  {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Generic/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1222754}
     BM_ByteStreamSplitDecode_Double_Sse2/65536  8.820 GiB/sec  8.815 GiB/sec    -0.055     {'family_index': 15, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12677}
      BM_ByteStreamSplitDecode_Double_Sse2/1024  8.979 GiB/sec  8.958 GiB/sec    -0.234     {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitDecode_Double_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 814584}

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (2)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                benchmark       baseline     contender  change %                                                                                                                                                                                   counters
BM_ByteStreamSplitEncode_Float_Sse2/65536 11.041 GiB/sec 8.209 GiB/sec   -25.647  {'family_index': 16, 'per_family_instance_index': 1, 'run_name': 'BM_ByteStreamSplitEncode_Float_Sse2/65536', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 30890}
 BM_ByteStreamSplitEncode_Float_Sse2/1024 11.055 GiB/sec 8.168 GiB/sec   -26.114 {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BM_ByteStreamSplitEncode_Float_Sse2/1024', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2024308}

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 12, 2025
@pitrou
Copy link
Member

pitrou commented Jun 12, 2025

Now that we have SIMD optimizations for this, can we make sure the benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, AVX2, Neon).

Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit ca35643.

There were 72 benchmark results with an error:

There were 18 benchmark results indicating a performance regression:

The full Conbench report has more details.

@AntoinePrv
Copy link
Author

Now that we have SIMD optimizations for this, can we make sure the benchmarks cover the different cases? Scalar and the various SIMD kinds (SSE, AVX2, Neon).

I've done so with int16_t. Since it's gated behind SIMD macros, it should not be a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants