-
Notifications
You must be signed in to change notification settings - Fork 15
[ISSUE-815] Generate Random Numbers Asynchronously on the GPU #859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: SharedDevelopment
Are you sure you want to change the base?
Conversation
…g to #include <cuda_runtime.h>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor cleanup.
I have implemented all the changes you requested as well as renamed the stream used by all the synchronous kernels to simulationStream (simulationStream_ as a member variable). I have also added new documentation to the developer docs and linked it into index.md. The old MersenneTwister files are still there if anyone wanted to try it out again, but if you'd like I could remove them. |
Closes #815
Description
Replaced the custom Mersenne Twister GPU kernel with an AsyncPhilox_d class that asynchronously fills GPU buffers with random noise using cuRAND's Philox generator. The class supports double-buffering and is designed for concurrent execution.
GPUModel initializes Philox states and fills two initial buffers via loadPhilox() on a member AsyncPhilox_d instance. During each advance() call, requestSegment() retrieves a float* slice from the currently active buffer, sized appropriately for each vertex and ready to be used in advanceVertices().
Once a buffer is consumed, fillBuffer() is triggered on the other buffer while the current one continues to serve slices. This ensures continuous data availability through double-buffering.
AsyncPhilox_d uses its own internal CUDA stream to launch fill kernels asynchronously. To enable true concurrency, all other compute kernels needed to also use non-default streams. This is necessary because stream 0 (the default stream) implicitly synchronizes with all other streams, preventing concurrent execution and causing the scheduler to serialize kernel launches even when they could run in parallel.
Checklist (Mandatory for new features)
Testing (Mandatory for all changes)
test-medium-connected.xml
Passedtest-large-long.xml
Passed