Skip to content

cuda.parallel: Check compiled code for LDL/STL instructions in tests #4472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 17, 2025

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Apr 16, 2025

Description

This PR adds infrastructure that ensures that compiled code is checked for LDL/STL instructions similar to c.parallel during the execution of pytests.

When this is enabled, I noticed several test failures:

FAILED tests/test_merge_sort.py::test_merge_sort_keys_complex - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_cache_modified_input_it[False-int16] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_constant_it[False-int16] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_constant_it[True-int16] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_counting_it[True-int16] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_complex_device_reduce - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_counting_it[False-int16] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_map_mul2_count_it[True-value_type_name_pair0] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_cache_modified_input_it[True-int16] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce.py::test_device_sum_map_mul2_count_it[False-value_type_name_pair0] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int8-u8] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int16-i4] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int16-u4] - AssertionError: LDL instruction found in SASS
FAILED tests/test_scan.py::test_scan_array_input[int16-True] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int16-i8] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int16-u8] - AssertionError: LDL instruction found in SASS
FAILED tests/test_scan.py::test_scan_array_input[int16-False] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int8-i4] - AssertionError: LDL instruction found in SASS
FAILED tests/test_reduce_api.py::test_reduce_struct_type_minmax - AssertionError: LDL instruction found in SASS
FAILED tests/test_scan.py::test_scan_array_input[complex64-False] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int8-u4] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[int8-i8] - AssertionError: LDL instruction found in SASS
FAILED tests/test_scan.py::test_scan_array_input[complex128-True] - AssertionError: LDL instruction found in SASS
FAILED tests/test_scan.py::test_scan_array_input[complex128-False] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[complex128-i4] - AssertionError: LDL instruction found in SASS
FAILED tests/test_transform.py::test_binary_transform[complex128] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[complex128-u4] - AssertionError: LDL instruction found in SASS
FAILED tests/test_transform.py::test_unary_transform[complex128] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[complex128-i8] - AssertionError: LDL instruction found in SASS
FAILED tests/test_segmented_reduce.py::test_segmented_reduce[complex128-u8] - AssertionError: LDL instruction found in SASS
FAILED tests/test_unique_by_key.py::test_unique_by_key_complex - AssertionError: LDL instruction found in SASS
======================================================================== 31 failed, 787 passed, 4 skipped in 56.06s ========================================================================

These failures are addressed in the follow-up PR #4249. Until then, I have turned off the automatic checking for LDL/STL in this PR.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested a review from a team as a code owner April 16, 2025 15:47
@shwina shwina requested a review from kkraus14 April 16, 2025 15:47
@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 16, 2025
@shwina shwina moved this from Todo to In Progress in CCCL Apr 16, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Apr 16, 2025
@shwina shwina self-assigned this Apr 16, 2025
Comment on lines +1075 to +1076
def _get_cubin(self):
return self.build_data.cubin[:self.build_data.cubin_size]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksandr-pavlyk I'm curious if you have any suggestions for how to be a bit more DRY about this. Currently I am defining this method in all the *BuildResult cdef classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheritance doesn't quite work as the member build_data is distinct in each of the classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid I do not see any other way than defining this function for every *BuildResult class. If we had some uniformity in data layout of build_data, for example, const char * cubin and size_t cubin_size are first two elements of the build_result C struct we could at least write a helper function that takes a base C struct in and produce memview, and could call that helper function from every _get_cubin method definition. This would not save much though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thank you for double checking. Let's leave it as it is for now. Thanks!

monkeysession.setattr(
cuda.parallel.experimental._cccl_interop,
"_check_sass",
False, # todo: change to True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: changing this to True is what ensures that compile results are checked for LDL/STL instructions.

Copy link
Contributor

🟩 CI finished in 26m 57s: Pass: 100%/3 | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
  • 🟩 python: Pass: 100%/3 | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s

    🟩 cpu
      🟩 amd64              Pass: 100%/3   | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
    🟩 ctk
      🟩 12.8               Pass: 100%/3   | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/3   | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/3   | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
    🟩 cxx
      🟩 GCC13              Pass: 100%/3   | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/3   | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/3   | Total: 30m 30s | Avg: 10m 10s | Max: 21m 06s
    🟩 jobs
      🟩 cuda.cccl          Pass: 100%/1   | Total:  3m 11s | Avg:  3m 11s | Max:  3m 11s
      🟩 cuda.cooperative   Pass: 100%/1   | Total: 21m 06s | Avg: 21m 06s | Max: 21m 06s
      🟩 cuda.parallel      Pass: 100%/1   | Total:  6m 13s | Avg:  6m 13s | Max:  6m 13s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
3 linux-amd64-gpu-rtx2080-latest-1

Copy link
Contributor

🟩 CI finished in 22m 47s: Pass: 100%/3 | Total: 25m 37s | Avg: 8m 32s | Max: 16m 39s
  • 🟩 python: Pass: 100%/3 | Total: 25m 37s | Avg: 8m 32s | Max: 16m 39s

    🟩 cpu
      🟩 amd64              Pass: 100%/3   | Total: 25m 37s | Avg:  8m 32s | Max: 16m 39s
    🟩 ctk
      🟩 12.8               Pass: 100%/3   | Total: 25m 37s | Avg:  8m 32s | Max: 16m 39s
    🟩 cudacxx
      🟩 nvcc12.8           Pass: 100%/3   | Total: 25m 37s | Avg:  8m 32s | Max: 16m 39s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/3   | Total: 25m 37s | Avg:  8m 32s | Max: 16m 39s
    🟩 cxx
      🟩 GCC13              Pass: 100%/3   | Total: 25m 37s | Avg:  8m 32s | Max: 16m 39s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/3   | Total: 25m 37s | Avg:  8m 32s | Max: 16m 39s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/3   | Total: 25m 37s | Avg:  8m 32s | Max: 16m 39s
    🟩 jobs
      🟩 cuda.cccl          Pass: 100%/1   | Total:  2m 46s | Avg:  2m 46s | Max:  2m 46s
      🟩 cuda.cooperative   Pass: 100%/1   | Total: 16m 39s | Avg: 16m 39s | Max: 16m 39s
      🟩 cuda.parallel      Pass: 100%/1   | Total:  6m 12s | Avg:  6m 12s | Max:  6m 12s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 3)

# Runner
3 linux-amd64-gpu-rtx2080-latest-1

Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @shwina

@shwina shwina merged commit 71097d1 into NVIDIA:main Apr 17, 2025
22 of 23 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants