Skip to content

[SYSTEMDS-3874] Java17 Vectorized LibMM #2216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

Baunsgaard
Copy link
Contributor

This PR contains code for vectorized MM It further include some vectorized instructions for CLA.

@Baunsgaard Baunsgaard changed the title [SYSTEMDS-???] Java17 Vectorized LibMM [SYSTEMDS-3874] Java17 Vectorized LibMM May 15, 2025
This commit adds vectorized kernels for matrix multiplication.

fix mm error

Perf mm

bigger scale

remove compile log
@Baunsgaard
Copy link
Contributor Author

The results show that the vector api improve performance, for single threaded execution our AMD box improve by ~80% and Intel ~60% for dense mm. These improvements, are with allocation overhead of the output, and in ideal cases where the input are cached properly, and the JIT compilation is done.

--- The failing tests are related to some sparse kernels with wrong results, will be fixed shortly.

Script to evaluate mm performance:

./src/test/scripts/performance/matrixMultiplication.sh

After:

SU1
MM Perf : rep 5000 -- [1009, 100, 100, 1000, 1, 1]
                    mm SingleThread,    2.170+-  0.129 ms,           
                mm MultiThread: 128,    1.675+-  0.198 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 1000, 1, 1]
                    mm SingleThread,  215.661+- 44.191 ms,           
                mm MultiThread: 128,   23.836+-  1.566 ms,           
MM Perf : rep 100 -- [1009, 10000, 1000, 1000, 1, 1]
                    mm SingleThread, 2189.817+-407.982 ms,           
                mm MultiThread: 128,   62.026+-  2.399 ms,           
MM Perf : rep 100 -- [1009, 1000, 10000, 1000, 1, 1]
                    mm SingleThread, 2084.534+- 42.736 ms,           
                mm MultiThread: 128,   57.026+-  2.680 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 10000, 1, 1]
                    mm SingleThread, 2128.684+- 40.128 ms,           
                mm MultiThread: 128,   72.801+-  6.310 ms,           
MM Perf : rep 100 -- [1009, 10000, 10000, 10000, 1, 1]
                    mm SingleThread, 212909.303+-225.702 ms,           
                mm MultiThread: 128, 4725.499+- 51.779 ms, 
            
SO009
MM Perf : rep 5000 -- [1009, 100, 100, 1000, 1, 1]
                    mm SingleThread,    2.239+-  0.289 ms,           
                 mm MultiThread: 48,    0.360+-  0.052 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 1000, 1, 1]
                    mm SingleThread,  189.776+- 74.425 ms,           
                 mm MultiThread: 48,   12.043+-  0.635 ms,           
MM Perf : rep 100 -- [1009, 10000, 1000, 1000, 1, 1]
                    mm SingleThread, 1124.289+-273.464 ms,           
                 mm MultiThread: 48,   77.630+-  8.706 ms,           
MM Perf : rep 100 -- [1009, 1000, 10000, 1000, 1, 1]
                    mm SingleThread, 1152.994+-181.044 ms,           
                 mm MultiThread: 48,   74.188+- 15.157 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 10000, 1, 1]
                    mm SingleThread, 1171.379+- 48.588 ms,           
                 mm MultiThread: 48,   87.808+- 15.199 ms,           
MM Perf : rep 100 -- [1009, 10000, 10000, 10000, 1, 1]
                    mm SingleThread, 112851.281+-1033.173 ms,           
                 mm MultiThread: 48, 6611.407+- 77.775 ms, 

Before:

SU1
MM Perf : rep 5000 -- [1009, 100, 100, 1000, 1, 1]
                    mm SingleThread,    3.991+-  0.843 ms,           
                mm MultiThread: 128,    1.214+-  0.368 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 1000, 1, 1]
                    mm SingleThread,  331.139+- 13.338 ms,           
                mm MultiThread: 128,   32.462+-  3.078 ms,           
MM Perf : rep 100 -- [1009, 10000, 1000, 1000, 1, 1]
                    mm SingleThread, 3489.530+-549.035 ms,           
                mm MultiThread: 128,  115.628+-  3.087 ms,           
MM Perf : rep 100 -- [1009, 1000, 10000, 1000, 1, 1]
                    mm SingleThread, 3396.771+- 18.842 ms,           
                mm MultiThread: 128,  107.051+-  3.761 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 10000, 1, 1]
                    mm SingleThread, 3513.700+-370.610 ms,           
                mm MultiThread: 128,  117.059+-  3.606 ms,           
MM Perf : rep 100 -- [1009, 10000, 10000, 10000, 1, 1] 
                    mm MultiThread: 128, 8428.286+- 71.497 ms,     
SO009
MM Perf : rep 5000 -- [1009, 100, 100, 1000, 1, 1]
                    mm SingleThread,    3.409+-  0.111 ms,           
                 mm MultiThread: 48,    0.548+-  0.043 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 1000, 1, 1]
                    mm SingleThread,  283.795+-108.979 ms,           
                 mm MultiThread: 48,   21.052+-  0.719 ms,           
MM Perf : rep 100 -- [1009, 10000, 1000, 1000, 1, 1]
                    mm SingleThread, 2016.267+-199.360 ms,           
                 mm MultiThread: 48,  125.341+- 15.542 ms,           
MM Perf : rep 100 -- [1009, 1000, 10000, 1000, 1, 1]
                    mm SingleThread, 2075.305+-190.909 ms,           
                 mm MultiThread: 48,  123.638+- 24.349 ms,           
MM Perf : rep 100 -- [1009, 1000, 1000, 10000, 1, 1]
                    mm SingleThread, 2100.731+-189.725 ms,           
                 mm MultiThread: 48,  132.040+- 24.115 ms,           
MM Perf : rep 100 -- [1009, 10000, 10000, 10000, 1, 1]
                 mm MultiThread: 48, 11441.370+-103.849 ms,  

Copy link

codecov bot commented May 22, 2025

Codecov Report

Attention: Patch coverage is 90.51095% with 13 lines in your changes missing coverage. Please review.

Project coverage is 72.95%. Comparing base (499b6e3) to head (1a819a3).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...e/sysds/runtime/compress/colgroup/ColGroupDDC.java 41.66% 7 Missing ⚠️
...pache/sysds/runtime/matrix/data/LibMatrixMult.java 92.50% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2216   +/-   ##
=========================================
  Coverage     72.94%   72.95%           
- Complexity    46061    46076   +15     
=========================================
  Files          1479     1479           
  Lines        172576   172613   +37     
  Branches      33776    33783    +7     
=========================================
+ Hits         125893   125928   +35     
- Misses        37186    37194    +8     
+ Partials       9497     9491    -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-project-automation github-project-automation bot moved this from In Progress to Done in SystemDS PR Queue May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant