-
Notifications
You must be signed in to change notification settings - Fork 500
[SYSTEMDS-??] Upgrade GPU backend to latest NVIDIA software stack #2271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This one is cool. Thanks for working on this! 🙌 Mistakenly closed the PR, while giving run permission to ci. Sorry |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2271 +/- ##
============================================
- Coverage 72.94% 72.77% -0.18%
- Complexity 46070 46073 +3
============================================
Files 1479 1479
Lines 172616 173031 +415
Branches 33783 33855 +72
============================================
+ Hits 125922 125925 +3
- Misses 37202 37611 +409
- Partials 9492 9495 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks @ReneEnjilian. |
Hi @phaniarnab,
I hope this clarifies why there cannot be any deprecated calls anymore. |
Thanks for clarifying, @ReneEnjilian. This is good. 👍🏽
All of these should be fairly easy to set up and run. You off course don't need to fix any unrelated bugs as part of this PR. |
I think this approach is problematic. Many of these scripts appeared to be buggy when testing and I doubt that they have been written and tested with GPUs in mind. When errors occur, it is unclear if they occur due to the changes implemented by myself or if they were there in the first place. The entire backend (excluding the kernels implemented in SystemDS.cu) is completely untested; literally 0% test coverage. I will try to add test coverage to the backend (specifically the parts that rely on cuDNN, cuSPARSE, cuBLAS). For now, could you please provide a nn script that definitely worked with the old setting (CUDA 10.2). Further, it would be great if you could also tell me exactly how you executed the program. For example did you use "-gpu" or "-gpu force" etc. |
Sure, I will provide some tests @ReneEnjilian. The GPU backend is well tested. Mark added an infrastructure to run all tests with GPU, and later I kept using that for my GPU tests, which includes cudnn and cusparse libraries and fused operators like conv2d-bias-add. Find two examples from my yesterday's run in my laptop: Grid search hyperparameter tuning for LM (no cuDNN):
ResNet18 (w/ cuDNN) ---
You can simply add I recommend start with simple scripts, such as the individual layers under nn/layers. Once they run fine, you can run the scripts under nn/examples and nn/networks. Later we can try the scripts I implemented for MEMPHIS, which are more complex and test GPU memory management and copy performance. Except some of the recent scripts, I regularly executed the nn scripts on GPU. Outside of nn, LM and other simpler builtins should run without error. For the ResNet example above, notice the large cu libraries init time, which I mentioned before. I am curious if this issue is gone with the new CUDA. I agree with you that we need to make the testing framework better for GPU in the near future. Let me know if these help. I can run some more examples later today. |
Thanks for some of the clarifications.
From your previous answer, I could not understand how you executed the scripts (junit or terminal) and which scripts were utilized. I just try to have a reproducible baseline to ensure that my code is indeed producing correct results. |
Thanks for the questions @ReneEnjilian. The github tests do not run GPU tests automatically. The code coverage for the gpu backend needs to be generated manually, which we never did. By Mark's framework, I simply mean the way to use gpu for any Junit tests. Just add My grid search example was from The Feel free to reach out if you need further help. |
Okay I resolved existing bugs by using the test cases described in |
Thanks, @ReneEnjilian, good start. This is precisely why we need testing. This means we require more testing to find and fix the broken kernels. As you figured out how to enforce gpu to the Junit tests, you can try various NN/buitlins to trigger all kernels. Finally, we will need to conduct some larger experiments to verify these extensive kernel rewrites are not creating performance regressions. Thanks for your continuous effort. |
Hi @phaniarnab, I think there is a misunderstanding. With rewriting I dont mean CUDA kernels (defined in SystemDS.cu) but actually the Java methods in in |
Thanks, @ReneEnjilian. I indeed misunderstood. |
I also ran the tests in
However, there will be cases when just enforcing on the GPU will result in an error. This does not mean that this error is due to the changes in this pull request. It would also have failed before. For example,
All those tests pass. Further, I manually changed the sparsity of the matrices in these to ensure that every method I rewrote is covered by those tests (verified via debugger). |
Thanks, @ReneEnjilian, for running those tests. Finally, as Matthias suggested offline, we need some perf tests. It is important to check for perf regressions after a major upgrade, like we did for Spark upgrade. We can use the scripts from MEMPHIS, where we have the prior results. I will help you with the scripts. It should not take more than a day for all the experiments to run once you have the node setup. |
Thanks @phaniarnab for providing the resnet script. It executed correctly and the CuLibraries init time is now also much lower than before. Here are the SystemDS statistics:
The init time(s) are now given by: |
This looks very good, @ReneEnjilian. Thanks. Feel free to add this script as a test. I will give you some more similar scripts later. Once you make the scale up node ready with the intended CUDA versions, I will point you to the scripts and help in running. We can start with this resent18 script. Increase the number of images to say 10k with batch size 32/64 and observe the nvidia-smi output just to get a feeling. Then we will execute the scripts from MEMPHIS with real data. Good job. |
Purpose
Upgrade the GPU backend to the latest NVIDIA software stack:
What Changed
1. Dependencies:
2. Code:
Notes
Next Steps
We still rely on JCuda as the Java–CUDA bridge. Now that the backend runs on the newest NVIDIA toolchain, the logical next phase is to replace JCuda with our own JNI bindings. Because JCuda itself is only a thin JNI wrapper, the migration is conceptually straightforward—essentially re-exposing the functions we actually use—but it will take time to:
Generate a minimal JNI layer for the required CUDA / cuSPARSE / cuBLAS / cuDNN calls.
Add build scripts to compile the native library on Linux and Windows.
Incrementally swap JCuda classes for the new binding with tests at every step.
The payoff is full control over versioning and fewer external dependencies once the work is complete.