Skip to content

[HIP][device] 4 __shfl_sync functions are missing #1491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kaveh01 opened this issue Oct 1, 2019 · 19 comments
Closed

[HIP][device] 4 __shfl_sync functions are missing #1491

Kaveh01 opened this issue Oct 1, 2019 · 19 comments

Comments

@Kaveh01
Copy link

Kaveh01 commented Oct 1, 2019

CUDA 9 __shfl_sync function is missing. I can use the deprecated __shfl but it would be
be better to have the new function.
Test code:

   __global__
static void shflTest(int lid){
    int tid = threadIdx.x;
    float value = tid + 0.1f;
    int* ivalue = reinterpret_cast<int*>(&value);

    //use the integer shfl
    int ix = __shfl(ivalue[0],5,32);
    int iy = __shfl_sync(0xFFFFFFFF, ivalue[0],5,32);

    float x = reinterpret_cast<float*>(&ix)[0];
    float y = reinterpret_cast<float*>(&iy)[0];

    if(tid == lid){
        printf("shfl tmp %d %d\n",ix,iy);
        printf("shfl final %f %f\n",x,y);
    }
}

int main()
{
    shflTest<<<1,32>>>(0);
    cudaDeviceSynchronize();
    return 0;
}
@emankov emankov added the hip label Oct 1, 2019
@emankov
Copy link
Contributor

emankov commented Oct 1, 2019

__shfl_up_sync, __shfl_down_sync, and __shfl_xor_sync as well.

@emankov emankov changed the title __shfl_sync is missing. [HIP][device] 4 __shfl_sync functions are missing Oct 1, 2019
@b-sumner
Copy link
Contributor

b-sumner commented Oct 1, 2019

We have some work left in the device compiler to support certain cuda 9 device side features such as the sync APIs. Also note that most AMD devices have a "warp size" of 64, so any code using a 32 bit mask is already broken.

@gmarkomanolis
Copy link

Hi, I was trying to hipify a code and there are a few calls to __shfl_down_sync. The __shfl_down is deprecated, so it can not be used with CUDA 11. What would be the best approach?

@acowley
Copy link
Contributor

acowley commented Feb 13, 2021

@gmarkomanolis What I do when using hipify-perl as part of a build process is include a construction like,

#ifdef __HIP_PLATFORM_HCC__
#define SHFL_DOWN(val, offset) __shfl_down(val, offset)
#else
#define SHFL_DOWN(val, offset) __shfl_down_sync(0xffffffff, val, offset)
#endif

The specific constant I'm using there (__HIP_PLATFORM_HCC) is old, so a newer one would be better.

@emankov
Copy link
Contributor

emankov commented Feb 13, 2021

Hi, I was trying to hipify a code and there are a few calls to __shfl_down_sync. The __shfl_down is deprecated, so it can not be used with CUDA 11. What would be the best approach?

__shfl_down is deprecated since CUDA 9.0, but it is not removed and still can be used even by CUDA 11.2.1.

@emankov
Copy link
Contributor

emankov commented Feb 13, 2021

The specific constant I'm using there (__HIP_PLATFORM_HCC) is old, so a newer one would be better.

What do you mean by old?

@acowley
Copy link
Contributor

acowley commented Feb 13, 2021

I think mentions of hcc are being removed over time.

@gmarkomanolis
Copy link

Hi, I was trying to hipify a code and there are a few calls to __shfl_down_sync. The __shfl_down is deprecated, so it can not be used with CUDA 11. What would be the best approach?

__shfl_down is deprecated since CUDA 9.0, but it is not removed and still can be used even by CUDA 11.2.1.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

Deprecation Notice: __shfl, __shfl_up, __shfl_down, and __shfl_xor have been deprecated in CUDA 9.0 for all devices.

Removal Notice: When targeting devices with compute capability 7.x or higher, __shfl, __shfl_up, __shfl_down, and __shfl_xor are no longer available and their sync variants should be used instead.

I will check though if it is on the code's side as it is not mine. Thanks for the answer.

@gmarkomanolis
Copy link

@gmarkomanolis What I do when using hipify-perl as part of a build process is include a construction like,

#ifdef __HIP_PLATFORM_HCC__
#define SHFL_DOWN(val, offset) __shfl_down(val, offset)
#else
#define SHFL_DOWN(val, offset) __shfl_down_sync(0xffffffff, val, offset)
#endif

The specific constant I'm using there (__HIP_PLATFORM_HCC) is old, so a newer one would be better.

Thanks a lot.

@jammm
Copy link

jammm commented Jul 14, 2021

Hey, @emankov, any update on __shfl_sync ? It would be great to have this implemented I think.

@leachim
Copy link

leachim commented Sep 23, 2022

Any update on this? I am specifically looking for a solution to __shfl_sync

@jammm
Copy link

jammm commented Sep 23, 2022

If your code uses a mask of 0xffffffff, then you can just replace your _sync calls with the non-sync ones and it should work fine.

wanghan-iapcm pushed a commit to deepmodeling/deepmd-kit that referenced this issue Sep 22, 2023
Merge `source/lib/src/cuda` and `source/lib/src/rocm` into
`source/lib/src/gpu`.

- Define macros `gpuGetLastError`, `gpuDeviceSynchronize`, `gpuMemcpy`,
`gpuMemcpyDeviceToHost`, `gpuMemcpyHostToDevice`, and `gpuMemset` to
make them available for both CUDA and ROCm.
- Use `<<< >>> syntax` for both CUDA and ROCm. Per
ROCm/hip@cf78d85,
it has been supported in HIP since 2018.
- Fix several int const numbers that should be double or float.
- For tabulate:
- Fix `WARP_SIZE` for ROCm. Per
pytorch/pytorch#64302, WARP_SIZE can be 32 or
64, so it should not be hardcoded to 64.
- Add `GpuShuffleSync`. Per
ROCm/hip#1491, `__shfl_sync`
is not supported by HIP.
  - After merging the code, #1274 should also work for ROCm.
- Use the same `ii` for #830 and #2357. Although both of them work, `ii`
has different meanings in these two PRs, but now it should be the same.
- However, `ii` in `tabulate_fusion_se_a_fifth_order_polynomial` (rocm)
added by #2532 is wrong. After merging the codes, it should be
corrected.
  - Optimization in #830 was not applied to ROCm.
  - `__syncwarp` is not supported by ROCm.
- After merging the code, #2661 will be applied to ROCm. Although TF
ROCm stream is still blocking
(https://github.com/tensorflow/tensorflow/blob/9d1262082e761cd85d6726bcbdfdef331d6d72c6/tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc#L566),
we don't know whether it will change to non-blocking.
- There are several other differences between CUDA and ROCm.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
@ppanchad-amd
Copy link

@Kaveh01 Apologies for the lack of response. Can you please test with latest ROCm 6.1.0 (HIP 6.1)? If resolved, please close ticket. Thanks!

@lahwaacz
Copy link

@ppanchad-amd You could have just said that the _sync functions were added to the C++ kernel language in some ROCm/HIP version 🤷

@Vishal-S-P
Copy link

I am using rocm 6.1.3 yet I still keep getting this issue. "error: use of undeclared identifier '__shfl_down_sync'"

@b-sumner
Copy link
Contributor

b-sumner commented Jul 2, 2024

The *_sync functions are not available in 6.1, see, e.g. https://github.com/ROCm/clr/tree/rocm-6.1.x/hipamd/include/hip/amd_detail . The develop branch has an implementation which may appear in a future release.

@b-sumner
Copy link
Contributor

b-sumner commented Jul 2, 2024

The develop implementation mentioned above has restrictions on its use that match the restrictions stated for pascal in the cuda guide.

@ppanchad-amd ppanchad-amd reopened this Jul 2, 2024
@lahwaacz
Copy link

lahwaacz commented Jul 3, 2024

The C++ Language Extensions documentation for ROCm 6.1.2 / HIP 6.1.40092 describes this as if the __sync functions were already a thing.

Note that the __sync variants are made available in ROCm 6.2

Note that this is the only reference to ROCm 6.2 in the entire document, the following sections simply list all the _sync variants without any reference to the future ROCm version. Why are future features documented in earlier releases? It seems like somebody just copy-pasted it from NVIDIA 🤷

@schung-amd
Copy link

Apologies for the unclear documentation. These functions are available and disabled by default in 6.2 as stated, usable via a preprocessor macro. If there are issues with their functionality, feel free to comment and we can reopen this thread, or you can submit a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests