ffi_call inside jitted function gives incorrect results #27101

staticimport · 2025-03-12T19:41:13Z

staticimport
Mar 12, 2025

I’m writing a custom CUDA kernel executed via ffi_call() and returns 2 tensors. This is called from within another function (actually method on an nnx Module, if that matters). Works perfectly if I don’t jit the method/module, but the results are wrong if I do. Any pro tips for what I might be doing wrong? I haven’t done it yet, but I can try to make and share a minimal reproducing example.

Thanks!

dfm · 2025-03-12T19:50:42Z

dfm
Mar 12, 2025
Collaborator

I don't have any suggestions off the top of my head. It'll be much easier to give tips if you can share a minimal reproducing example, so please do!

1 reply

staticimport Mar 12, 2025
Author

Will do! Good chance I figure out my own mistake just through the exercise. :)

staticimport · 2025-03-13T14:33:17Z

staticimport
Mar 13, 2025
Author

So I failed to reproduce it in a minimal way, but ultimately I seem to have resolved it. Probably just some misunderstanding about how it jit things properly.

fwiw I have a class Foo(nnx.Module) that has a member instance of a class Bar(nnx.Module). A few things I observed:

if I did def f(x, args): return x(args); fn = nnx.jit(f); fn(args) then I found the non-cuda version of Foo.call worked but my cuda impl did not.
if I did def f(args): return foo(args); fn = nnx.jit(f); fn(foo, args) then both non-cuda and cuda returned incorrect results
observed that if I commented out some carry (nnx.Variable) updating inside Bar.call (called within Foo.call), now both non-cuda and cuda versions returned expected results
in some of these flavors, looking at foo.carry outside of the call scope was different than actually having the call return whatever I wanted to see as output.
finally learned to do foo = nnx.jit(Foo)(constructor args) and this seems to work for both

So yeah I guess I jitted wrong (shame that I can do it multiple ways and it runs but gives different results), but surprising at least one variant had different results cuda vs non-cuda. Maybe just some UB shaking out differently. /shrug

1 reply

dfm Mar 13, 2025
Collaborator

Thanks for the update! I agree that this seems like buggy behavior, but I wonder if it's something flax-specific? I don't know much about how nnx.jit works compared to jax.jit. I would certainly consider this a bug if it was possible to reproduce in JAX directly!

staticimport · 2025-03-17T20:30:00Z

staticimport
Mar 17, 2025
Author

So even after this, I was still having invalid results for some of my scripts/jobs. After lots of head-on-table banging later, I think I found the core issue: I was doing cudaMemcpy, cudaMemset operations in my .cu host function to initialize outputs before launching the kernel. While cudaSynchronizeStream() fails hard, these synchronizing calls do not, so I'm left with undefined behavior. This is my best guess/understanding, anyway. I moved the logic into the kernel and I seem to be smooth sailing now. Just FYI in case anyone else stumbles on this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ffi_call inside jitted function gives incorrect results #27101

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ffi_call inside jitted function gives incorrect results #27101

Uh oh!

staticimport Mar 12, 2025

Replies: 3 comments · 2 replies

Uh oh!

dfm Mar 12, 2025 Collaborator

Uh oh!

staticimport Mar 12, 2025 Author

Uh oh!

staticimport Mar 13, 2025 Author

Uh oh!

dfm Mar 13, 2025 Collaborator

Uh oh!

staticimport Mar 17, 2025 Author

staticimport
Mar 12, 2025

Replies: 3 comments 2 replies

dfm
Mar 12, 2025
Collaborator

staticimport Mar 12, 2025
Author

staticimport
Mar 13, 2025
Author

dfm Mar 13, 2025
Collaborator

staticimport
Mar 17, 2025
Author