Skip to content

[SYSTEMDS-??] Upgrade GPU backend to latest NVIDIA software stack #2271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ReneEnjilian
Copy link
Contributor

Purpose

Upgrade the GPU backend to the latest NVIDIA software stack:

  • CUDA 12.6
  • cuSPARSE 12
  • cuBLAS 12
  • cuDNN 9

What Changed

1. Dependencies:

  • Removed every jcuda-apple entry from pom.xml (CUDA support for macOS ended with toolkit 10.2).
  • Updated jcuda 10.2.0 to version 12.6.0.

2. Code:

  • Replaced or rewrote functions that used CUDA, cuSPARSE, or cuDNN APIs now deprecated or removed in the 12.x / 9.x series.

Notes

  • No Apple GPU build is possible after this upgrade (CPU backend remains unaffected).
  • All changes were developed and tested against these exact versions: CUDA 12.6.85, cuSPARSE 12.5.4, cuBLAS 12.6.4, and cuDNN 9.10.1.
  • Deprecated calls were replaced strictly according to NVIDIA’s official porting guides in the documentation; the additional setup code you see (e.g., new descriptors, longer pipelines, etc.) is the expected overhead described in those documents and is therefore correct.

Next Steps

We still rely on JCuda as the Java–CUDA bridge. Now that the backend runs on the newest NVIDIA toolchain, the logical next phase is to replace JCuda with our own JNI bindings. Because JCuda itself is only a thin JNI wrapper, the migration is conceptually straightforward—essentially re-exposing the functions we actually use—but it will take time to:

  • Generate a minimal JNI layer for the required CUDA / cuSPARSE / cuBLAS / cuDNN calls.

  • Add build scripts to compile the native library on Linux and Windows.

  • Incrementally swap JCuda classes for the new binding with tests at every step.

The payoff is full control over versioning and fewer external dependencies once the work is complete.

@github-project-automation github-project-automation bot moved this to In Progress in SystemDS PR Queue Jun 5, 2025
@j143 j143 closed this Jun 6, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in SystemDS PR Queue Jun 6, 2025
@j143
Copy link
Contributor

j143 commented Jun 6, 2025

This one is cool. Thanks for working on this! 🙌

Mistakenly closed the PR, while giving run permission to ci. Sorry

@j143 j143 reopened this Jun 6, 2025
Copy link

codecov bot commented Jun 6, 2025

Codecov Report

Attention: Patch coverage is 0% with 514 lines in your changes missing coverage. Please review.

Project coverage is 72.77%. Comparing base (38b73ae) to head (4e76b77).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...trix/data/SinglePrecisionCudaSupportFunctions.java 0.00% 170 Missing ⚠️
...trix/data/DoublePrecisionCudaSupportFunctions.java 0.00% 163 Missing ⚠️
...ache/sysds/runtime/matrix/data/LibMatrixCuDNN.java 0.00% 64 Missing ⚠️
...s/runtime/instructions/gpu/context/CSRPointer.java 0.00% 54 Missing ⚠️
...atrix/data/LibMatrixCuDNNConvolutionAlgorithm.java 0.00% 37 Missing ⚠️
...untime/matrix/data/LibMatrixCuDNNRnnAlgorithm.java 0.00% 26 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2271      +/-   ##
============================================
- Coverage     72.94%   72.77%   -0.18%     
- Complexity    46070    46073       +3     
============================================
  Files          1479     1479              
  Lines        172616   173031     +415     
  Branches      33783    33855      +72     
============================================
+ Hits         125922   125925       +3     
- Misses        37202    37611     +409     
- Partials       9492     9495       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@phaniarnab
Copy link
Contributor

Thanks @ReneEnjilian.
This is good work, but we need to test well before merging this in.
Can you please use Mark's testing framework inside SystemDS, which is essentially enabling GPU so that all DML script execution uses GPU operators, and then execute the typical builtins? We need to confirm that all the API changes from 10.2 to 12.6 are implemented.

@ReneEnjilian
Copy link
Contributor Author

Hi @phaniarnab,
do you mean with Mark's testing framework the files CellwiseTmplTest.java and RowAggTmplTest.java under test/gpu/codegen? If so, I just ran all the tests specified there and they passed. Further, it is not possible that API changes were missed because then we could not successfully build the project. Jcuda 12 does not cover deprecated API calls from previous CUDA toolkits. My strategy for upgrading the backend was the following:

  1. I upgraded Jcuda in the pom.xml and then initiated a fresh build via maven.
  2. The build would fail by showing me the error "cannot resolve symbol" for the API calls no longer part of JCuda 12 (and thus CUDA 12).
  3. I fixed the deprecated API calls and initiated a fresh build again. Then, maven would show me the next deprecated API calls via the same error.
  4. I repeated this process until the build was successfull and all symbols could be resolved.

I hope this clarifies why there cannot be any deprecated calls anymore.

@phaniarnab
Copy link
Contributor

Thanks for clarifying, @ReneEnjilian. This is good. 👍🏽
We still need to run regression tests to catch any unnecessary side effects and to increase our confidence on this major upgrade before we merge it in. I recommend the below:

  • Execute our NN scripts with GPU enabled and verify if GPU is being used.
  • Execute one of the non-NN scripts, such as GMM, which would generate many H2D and D2H copies.
  • Run some of our NN scripts (with large inputs) on the scale up node. You can use synthetic data. I often faced an issue regarding long loading time of the CuDNN libraries in the GPU memory (cannot explain why). I don't expect that happening with the updated CUDA versions.

All of these should be fairly easy to set up and run. You off course don't need to fix any unrelated bugs as part of this PR.

@ReneEnjilian
Copy link
Contributor Author

I think this approach is problematic. Many of these scripts appeared to be buggy when testing and I doubt that they have been written and tested with GPUs in mind. When errors occur, it is unclear if they occur due to the changes implemented by myself or if they were there in the first place. The entire backend (excluding the kernels implemented in SystemDS.cu) is completely untested; literally 0% test coverage. I will try to add test coverage to the backend (specifically the parts that rely on cuDNN, cuSPARSE, cuBLAS). For now, could you please provide a nn script that definitely worked with the old setting (CUDA 10.2). Further, it would be great if you could also tell me exactly how you executed the program. For example did you use "-gpu" or "-gpu force" etc.

@phaniarnab
Copy link
Contributor

Sure, I will provide some tests @ReneEnjilian. The GPU backend is well tested. Mark added an infrastructure to run all tests with GPU, and later I kept using that for my GPU tests, which includes cudnn and cusparse libraries and fused operators like conv2d-bias-add. Find two examples from my yesterday's run in my laptop:

Grid search hyperparameter tuning for LM (no cuDNN):

SystemDS Statistics:
Total elapsed time:		9.070 sec.
Total compilation time:		2.966 sec.
Total execution time:		6.103 sec.
Number of compiled Spark inst:	124.
Number of executed Spark inst:	0.
CUDA/CuLibraries init time:	1.123/1.621 sec.
Number of executed GPU inst:	11452.
GPU mem alloc time  (alloc(success/fail) / dealloc / set0):	0.061(0.061/0.000) / 0.058 / 0.154 sec.
GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):	3156(3156/0/17376) / 2527 / 20532.
GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):	0.136(0.000/0.000) / 0.244(0.000/0.000) / 0.000(0.000/0.000) sec.
GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):	2605(0/0) / 2527(0/0) / 0(0/0).
GPU conversion time  (sparseConv / sp2dense / dense2sp):	0.000 / 0.018 / 0.802 sec.
GPU conversion count (sparseConv / sp2dense / dense2sp):	0 / 210 / 210.
Cache hits (Mem/Li/WB/FS/HDFS):	6482/0/0/0/0.
Cache writes (Li/WB/FS/HDFS):	1/384/0/1.
Cache times (ACQr/m, RLS, EXP):	0.290/0.010/0.051/0.162 sec.
HOP DAGs recompiled (PRED, SB):	525/6405.
HOP DAGs recompile time:	1.417 sec.
Functions recompiled:		2.
Functions recompile time:	0.008 sec.
Spark ctx create time (lazy):	0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col):	0.000/0.000/0.000 secs.
Async. OP count (pf,bc,op): 	260/0/0.
Total JIT compile time:		36.875 sec.
Total JVM GC count:		2.
Total JVM GC time:		0.014 sec.
Heavy hitter instructions:
  #  Instruction  Time(s)  Count
  1  m_lm           4.865    525
  2  m_lmDS         4.744    525
  3  gpu_solve      2.118    525
  4  gpu_+          0.955   1225
  5  l2norm         0.401    525
  6  leftIndex      0.247   1925
  7  write          0.163      1
  8  gpu_ba+*       0.126   1575
  9  gpu_*          0.125   1470
 10  gpu_append     0.093    700

ResNet18 (w/ cuDNN) ---

SystemDS Statistics:
Total elapsed time:		681.490 sec.
Total compilation time:		1.958 sec.
Total execution time:		679.532 sec.
CUDA/CuLibraries init time:	0.686/669.978 sec.
Number of executed GPU inst:	258.
GPU mem alloc time  (alloc(success/fail) / dealloc / set0):	0.016(0.016/0.000) / 0.000 / 0.021 sec.
GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):	99(99/0/635) / 4 / 734.
GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):	9.119(0.000/0.000) / 0.002(0.000/0.000) / 0.000(0.000/0.000) sec.
GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):	88(0/0) / 4(0/0) / 0(0/0).
GPU conversion time  (sparseConv / sp2dense / dense2sp):	0.000 / 1.102 / 0.000 sec.
GPU conversion count (sparseConv / sp2dense / dense2sp):	0 / 74 / 0.
Cache hits (Mem/Li/WB/FS/HDFS):	170/0/0/0/0.
Cache writes (Li/WB/FS/HDFS):	18/0/0/0.
Cache times (ACQr/m, RLS, EXP):	0.004/0.002/0.005/0.000 sec.
HOP DAGs recompiled (PRED, SB):	0/241.
HOP DAGs recompile time:	0.265 sec.
Functions recompiled:		1.
Functions recompile time:	0.024 sec.
Total JIT compile time:		10.543 sec.
Total JVM GC count:		1.
Total JVM GC time:		0.015 sec.
Heavy hitter instructions:
  #  Instruction          Time(s)  Count
  1  gpu_conv2d_bias_add  669.544     61
  2  resnet18_forward       9.731      1
  3  bn2d_forward           9.188     60
  4  gpu_batch_norm2d       9.110     60
  5  basic_block            8.823     24
  6  rand                   0.283    162
  7  gpu_*                  0.061     26
  8  gpu_uak+               0.051      1
  9  gpu_rightIndex         0.026      1
 10  getWeights             0.023     54

You can simply add AutomatedTestBase.TEST_GPU = true; to any Java test to enable GPU (e.g., GPUFullReuseTest.java). For arbitrary dml scripts, you need to use -gpu option as you already found out. Notice the GPU-specific statistics, which are well implemented and helpuful. I will find a way to send you my scripts.

I recommend start with simple scripts, such as the individual layers under nn/layers. Once they run fine, you can run the scripts under nn/examples and nn/networks. Later we can try the scripts I implemented for MEMPHIS, which are more complex and test GPU memory management and copy performance. Except some of the recent scripts, I regularly executed the nn scripts on GPU. Outside of nn, LM and other simpler builtins should run without error.

For the ResNet example above, notice the large cu libraries init time, which I mentioned before. I am curious if this issue is gone with the new CUDA.
CUDA/CuLibraries init time: 0.686/669.978 sec.

I agree with you that we need to make the testing framework better for GPU in the near future.

Let me know if these help. I can run some more examples later today.

@ReneEnjilian
Copy link
Contributor Author

Thanks for some of the clarifications.
Maybe I am missing something but from what I see on codecov, almost every GPU/CUDA related file has 0% coverage. What do you mean with Mark's infrastructure ? Is it again the 2 files under test/gpu/codegen/ ? Could you please be clearer which files you used in your example (providing location and file name) ? Further, I could not get how you executed these scripts. For me, something like systemds Example-ResNet.dml -gpu -stats will not work. What worked for me was

  • java --add-modules=jdk.incubator.vector -Xmx4g -Xms4g -Xmn400m -cp target/systemds-3.4.0-SNAPSHOT.jar:target/lib/* org.apache.sysds.api.DMLScript -f scripts/nn/examples/Example-ResNet.dml -exec singlenode -gpu

From your previous answer, I could not understand how you executed the scripts (junit or terminal) and which scripts were utilized. I just try to have a reproducible baseline to ensure that my code is indeed producing correct results.

@phaniarnab
Copy link
Contributor

Thanks for the questions @ReneEnjilian.

The github tests do not run GPU tests automatically. The code coverage for the gpu backend needs to be generated manually, which we never did.

By Mark's framework, I simply mean the way to use gpu for any Junit tests. Just add AutomatedTestBase.TEST_GPU = true; to the test class before runTest. That would internally enable gpu for that test. Then it is up to the SystemDS compiler to leverage gpu. The advantage of using gpu for the Junit tests is that the tests automatically assert correctness against R. Mark used to enable this config before running all Junit tests, allowing all tests to use gpu.

My grid search example was from GPUFullReuseTest.java under test/functions/lineage. To enable the GPU to be used by the tests, simply comment out the checkGPU function along with the @BeforeClass tag. I had to keep the checkGPU function to ignore these tests if no gpu is available.
In contrast, the Resnet example was an independent dml script which I ran using command line. I will share this script with you via other means.

The systemds run script might not work for gpu. What worked for you seems to be correct. I use a similar call.

Feel free to reach out if you need further help.

@ReneEnjilian
Copy link
Contributor Author

Okay I resolved existing bugs by using the test cases described in GPUFullReuseTest.java. This required some extensive rewriting for some of the kernels. Before making these changes, the second and fourth test in GPUFullReuseTest.java would fail. This was mainly due to cusparsecsrgemm() and cusparsecsrgeam().

@phaniarnab
Copy link
Contributor

Thanks, @ReneEnjilian, good start. This is precisely why we need testing.
Why the rewrites are necessary? Did the old kernels get obsolete in the updated cuSparse version?

This means we require more testing to find and fix the broken kernels. As you figured out how to enforce gpu to the Junit tests, you can try various NN/buitlins to trigger all kernels. Finally, we will need to conduct some larger experiments to verify these extensive kernel rewrites are not creating performance regressions.

Thanks for your continuous effort.

@ReneEnjilian
Copy link
Contributor Author

Hi @phaniarnab, I think there is a misunderstanding. With rewriting I dont mean CUDA kernels (defined in SystemDS.cu) but actually the Java methods in in DoublePrecisionCudaSupportFunctions.java. The same ones I changed in the first place. I only had to make changes to the files with deprecated methods (the ones I worked in the first place). A good example of what I mean with rewriting is that cusparseSpGEMM can not handle transpositions and needs sorted rows. So I had to find workarounds like writing my own transpose method for CSR matrices and methods that would ensure sorted rows on the GPU. There are no broken kernels. You can look at my changed files and will see what I mean. Further, I ran the unit tests that failed here. They both passed on my machine and the address SPARK and CP not GPU. So I would assume that this is a Cloud issue ?

@phaniarnab
Copy link
Contributor

Thanks, @ReneEnjilian. I indeed misunderstood.
Regarding missing transposition and sorting, we may need to identify that in the compiler and place a transpose and sort operations before cusparseSpGEMM. But that change for futures PRs. The workaround is fine for now.
I hope more testing will find out similar issues if present.

@ReneEnjilian
Copy link
Contributor Author

I also ran the tests in LineageTraceGPUTest.java and GPULineageCacheEvictionTest.java and they also passed. I liked these tests because they (and GPUFullReuseTest.java) have definitely been tested on the old backend and with them I have a ground truth. If they fail, I know it is due to my changes. However, this can not be said about the other tests. Of course, I ran other existing tests by adding AutomatedTestBase.TEST_GPU = true; to enforce execution on the GPU. Examples (among others):

  • Conv1DTest
  • Conv2DBackwardDataTest
  • Conv2DBackwardTest
  • Conv2DTest
  • PoolBackwardTest
  • PoolTest
  • ReluBackwardTest

However, there will be cases when just enforcing on the GPU will result in an error. This does not mean that this error is due to the changes in this pull request. It would also have failed before. For example, LSTMTest fails because LSTM is a dedicated operator in CP. If you force it on the GPU, SystemDS does not find gpu_lstm and hence the system crashes. The crash happens during the creation of the instruction string. This is one example of many. The GPU backend lacks behind other parts of the system and hence certain operations will not work. But this was also the case before. To my best knowledge, there are no dedicated GPU tests other than:

  • GPUFullReuseTest
  • GPULineageCacheEvictionTest
  • LineageTraceGPUTest
  • CellwiseTmplTest
  • RowAggTmplTest
  • BuiltinUnaryGPUInstructionTest

All those tests pass. Further, I manually changed the sparsity of the matrices in these to ensure that every method I rewrote is covered by those tests (verified via debugger).
@mboehm7 Could you please guide us which requirements/expectations need to be fulfilled such that this patch can be merged in ?

@phaniarnab
Copy link
Contributor

Thanks, @ReneEnjilian, for running those tests.
Can you please run ReseNet18 (I sent you via email) and paste the stats output here?

Finally, as Matthias suggested offline, we need some perf tests. It is important to check for perf regressions after a major upgrade, like we did for Spark upgrade. We can use the scripts from MEMPHIS, where we have the prior results. I will help you with the scripts. It should not take more than a day for all the experiments to run once you have the node setup.

@ReneEnjilian
Copy link
Contributor Author

Thanks @phaniarnab for providing the resnet script. It executed correctly and the CuLibraries init time is now also much lower than before. Here are the SystemDS statistics:

SystemDS Statistics:
Total elapsed time:		5.130 sec.
Total compilation time:		1.452 sec.
Total execution time:		3.678 sec.
CUDA/CuLibraries init time:	0.394/0.121 sec.
Number of executed GPU inst:	258.
GPU mem alloc time  (alloc(success/fail) / dealloc / set0):	0.012(0.012/0.000) / 0.000 / 0.010 sec.
GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):	99(99/0/635) / 4 / 734.
GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):	2.140(0.000/0.000) / 0.001(0.000/0.000) / 0.000(0.000/0.000) sec.
GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):	88(0/0) / 4(0/0) / 0(0/0).
GPU conversion time  (sparseConv / sp2dense / dense2sp):	0.000 / 0.096 / 0.000 sec.
GPU conversion count (sparseConv / sp2dense / dense2sp):	0 / 74 / 0.
Cache hits (Mem/Li/WB/FS/HDFS):	170/0/0/0/0.
Cache writes (Li/WB/FS/HDFS):	23/0/0/0.
Cache times (ACQr/m, RLS, EXP):	0.004/0.002/0.004/0.000 sec.
HOP DAGs recompiled (PRED, SB):	0/241.
HOP DAGs recompile time:	0.274 sec.
Functions recompiled:		1.
Functions recompile time:	0.073 sec.
Total JIT compile time:		10.022 sec.
Total JVM GC count:		2.
Total JVM GC time:		0.02 sec.
Heavy hitter instructions:
  #  Instruction          Time(s)  Count
  1  resnet18_forward       3.220      1
  2  basic_block            2.347     24
  3  bn2d_forward           2.241     60
  4  gpu_batch_norm2d       2.158     60
  5  rand                   0.587    162
  6  gpu_conv2d_bias_add    0.214     61
  7  gpu_ba+*               0.068      7
  8  gpu_max                0.059     51
  9  gpu_softmax            0.040      3
 10  gpu_*                  0.028     26

The init time(s) are now given by: CUDA/CuLibraries init time: 0.394/0.121 sec.. With your consent, since this is your script, I would like to add this script as unit test for the GPU-backend for future debugging purposes. I will also add other tests ensuring good coverage. In the next phase, we can do the perf test to investigate if the changes may result in decreased performance.

@phaniarnab
Copy link
Contributor

This looks very good, @ReneEnjilian. Thanks. Feel free to add this script as a test. I will give you some more similar scripts later.

Once you make the scale up node ready with the intended CUDA versions, I will point you to the scripts and help in running. We can start with this resent18 script. Increase the number of images to say 10k with batch size 32/64 and observe the nvidia-smi output just to get a feeling. Then we will execute the scripts from MEMPHIS with real data.

Good job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants