Skip to content

Build and test with CUDA 12.9.0 #18721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: branch-25.06
Choose a base branch
from

Conversation

bdice
Copy link
Contributor

@bdice bdice commented May 8, 2025

This PR uses CUDA 12.9.0 to build and test.

xref: rapidsai/build-planning#173

@bdice bdice requested review from a team as code owners May 8, 2025 15:06
@bdice bdice requested a review from msarahan May 8, 2025 15:06
@bdice bdice added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels May 8, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 8, 2025
@jakirkham
Copy link
Member

jakirkham commented May 8, 2025

Thanks Bradley! 🙏

Do we want to update the Spark image to CUDA 12.9 as well ( cc @sameerz )?

image: rapidsai/ci-spark-rapids-jni:rockylinux8-cuda12.8.0

Also it looks like pre-commit got stuck on git checkout. Is there an easy way to restart it?

Edit: We noted offline that there was a GitHub incident today. Likely that caused the pre-commit issue. It has since cleared up

@jakirkham
Copy link
Member

Seeing the following build error in CI

 │ │ $BUILD_PREFIX/bin/../lib/gcc/x86_64-conda-linux-gnu/13.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lnvToolsExt: No such file or directory
 │ │ collect2: error: ld returned 1 exit status
 │ │ [3/8] Building CUDA object CMakeFiles/custom_p

The nvToolsExt library was dropped in CUDA 12.9

That said, think we are just using NVTX headers (like nvtx3/nvtoolsext.h). So we don't need the library to be linked. And can disable it

Think lines like this...

target_link_libraries(parquet_io PRIVATE cudf::cudf nvToolsExt $<TARGET_OBJECTS:parquet_io_utils>)

...can either be replaced with...

nvtx3::nvtx3-cpp

...as is done here

PRIVATE $<BUILD_LOCAL_INTERFACE:nvtx3::nvtx3-cpp> cuco::cuco ZLIB::ZLIB nvcomp::nvcomp

@jakirkham
Copy link
Member

After discussing with the Spark team offline, they have filed a CUDA 12.9 update issue: NVIDIA/spark-rapids#12679

@bdice bdice requested a review from a team as a code owner May 8, 2025 18:12
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels May 8, 2025
@jakirkham
Copy link
Member

Seeing the following test failure in conda C++ tests:

[  FAILED  ] 3 tests, listed below:
[  FAILED  ] CollectTestFixedWidth/22.CollectSet, where TypeParam = numeric::fixed_point<__int128,(numeric::Radix)10>
[  FAILED  ] ReductionHistogramTest/8.Histogram, where TypeParam = numeric::fixed_point<__int128,(numeric::Radix)10>
[  FAILED  ] ReductionHistogramTest/8.MergeHistogram, where TypeParam = numeric::fixed_point<__int128,(numeric::Radix)10>

 3 FAILED TESTS
CMake Error at run_gpu_test.cmake:35 (execute_process):
  execute_process failed command indexes:

    1: "Child return code: 1"

@jakirkham
Copy link
Member

Also seeing the following test failure in conda and wheel Python tests on CI:

FAILED tests/test_dataframe.py::test_decimal_quantile[Decimal128Dtype-higher-1] - AssertionError: DataFrame.iloc[:, 1] (column name="val") are different

DataFrame.iloc[:, 1] (column name="val") values are different (100.0 %)
[index]: [1.0]
[left]:  [98.14]
[right]: [453.23]
At positional index 0, first diff: 98.14 != 453.23
FAILED tests/test_dataframe.py::test_decimal_quantile[Decimal128Dtype-higher-q3] - AssertionError: DataFrame.iloc[:, 1] (column name="val") are different

@@ -7,7 +7,7 @@ jobs:
spark-rapids-jni-build:
runs-on: linux-amd64-cpu8
container:
image: rapidsai/ci-spark-rapids-jni:rockylinux8-cuda12.8.0
image: rapidsai/ci-spark-rapids-jni:rockylinux8-cuda12.9.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pxLi have bumped the Spark RAPIDS CI image to CUDA 12.9.0 above

Is there anything else we need to do?

@wence-
Copy link
Contributor

wence- commented May 9, 2025

These failures are all with 128bit decimals which smells like miscompilation somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cudf.pandas Issues specific to cudf.pandas improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants