Efficiently transferring a large array from CPU to multiple GPUs #27595

puct9 · 2025-03-30T06:53:51Z

puct9
Mar 30, 2025

The context for this questions revolves around transferring a single unpinned array on host of shape and type f32[4096, 8192] to multiple devices gpu0..7. The difficulty is that the runtime required for this action seems higher than anticipated.

The broader scope is the perform an optimizer step across the sharded data on device.

@jax.jit
def train_step(params, data):  # `data` is sharded f32[4096, 8192]
    ...

Going off this tutorial, the approach is to shard the data and the pass it to the jitted function. However, the transfer takes time.

I tried to get an idea of a reasonable level of performance to expect. Unfortunately, I don't have access to the H1/200s I usually work on on weekends so this was conducted on Colab with a TPUv2x8, but the story told by the metrics remains the same (just a magnitude off).

# small, non-sharded example
host_x_small = jnp.zeros((512, 8192), device=jax.devices("cpu")[0])
%timeit jax.device_put(host_x_small, jax.devices()[0]).block_until_ready()

Output: 3.05 ms ± 46.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mesh = jax.sharding.Mesh(jax.devices(), ("dev"))
sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec("dev"))

# Large minibatch: 4K samples, 8K features
host_x = jnp.zeros((4096, 8192), device=jax.devices("cpu")[0])
%timeit jax.device_put(host_x, sharding).block_until_ready()

Output: 91.5 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

What does not make sense here to me is why the sharded transfer would take 30 times long as the single transfer. Naively, I would imagine the "poor man's" version could not possibly be better:

# perform transfers manually and serially
host_x_smalls = [jnp.zeros((512, 8192), device=jax.devices("cpu")[0]) for _ in range(8)]

def transfer(host_arrays):
  return [jax.device_put(x, tpu).block_until_ready() for x, tpu in zip(host_arrays, jax.devices())]

%timeit transfer(host_x_smalls)

Output: 26.9 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now this is more in line with what I would expect! But I would also imagine that Jax should have some sort of advantage when using device_put with a sharding argument (e.g., it could pipeline/overlay the transfers to different devices).

One idea I have is to attempt the following recipe:

Have the CPU buffer like before
Create 8x GPU buffers (i.e., gpu_buffers = [jax.empty((512, 8192)) for _ in range(8)])
Write some C++ to copy 8 chunks of 512x8192 to each of the GPU buffers. I have full control here and can put the transfers on separate cuda streams, maybe I will copy to pinned first, etc., and then create a binding to Python. (I am under the assumption that nanobind will provide me the Jax array's device pointer).
Recover the 4096x8192 sharded array in Jax (i.e., jax.make_array_from_single_device_arrays((4096, 8192), sharding, gpu_buffers))

To be clear, my questions are now:

Do you think the above idea would work? (If unsure, just an opinion here would be appreciated)
Why is the sharded device_put so much slower than expected?
What alternative for the transfer do I have in native Jax?
Am I missing some part of a bigger picture? Keeping in mind that my goal is to perform a step over a minibatch, and that the host to device transfer is just one step in it, is there an approach that can somehow "fold" the two steps together, and yield superior runtime performance?
A little off topic, but should I replicate shard the params (the model is really small)? For now, it just lives on cuda:0.

puct9 · 2025-03-31T14:23:21Z

puct9
Mar 31, 2025
Author

I decided to profile Jax to investigate further.

I set up the following experiment involving a single GPU.

host_x_small = jnp.zeros((512, 8192), device=jax.devices("cpu")[0])

# Specify the device
%timeit jax.device_put(host_x_small, jax.devices()[0]).block_until_ready()

# Specify a named shard with just one device
mesh = jax.sharding.Mesh([jax.devices()[0]], ("dev",))
sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec("dev"))
%timeit jax.device_put(host_x_small, sharding).block_until_ready()

# Specify a single device sharding
single_sharding = jax.sharding.SingleDeviceSharding(jax.devices()[0])
%timeit jax.device_put(host_x_small, single_sharding).block_until_ready()

Yields times
Specify the device: 4.97 ms ± 85.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Named shard: 7.1 ms ± 150 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Single device shard: 7.2 ms ± 201 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We can observe that the implementations using any sharding is consistently slower than specifying the device directly.

Investigating using Jax perf traces, we can see behaviour unique to the cases when sharding objects are used to describe devices:

Device:

with jax.profiler.trace("traces", create_perfetto_link=True):
    jax.device_put(host_x_small, jax.devices()[0]).block_until_ready()

We can see a fairly uninteresting call into device_put which lasts around 1.2ms, before a H2D transfer shortly after.

Named sharding:

with jax.profiler.trace("traces", create_perfetto_link=True):
    jax.device_put(host_x_small, sharding).block_until_ready()

This time, we note that there is a slightly different call chain, leading into a fairly lengthy call (>2ms) into TfrtCpuBuffer::CopyToDevice, which coincides with a D2H Dispatch in a separate thread. My suspicion is therefore that Jax requires some additional machinery with possible ITC and more overhead when moving arrays using sharding descriptions.

with jax.profiler.trace("traces", create_perfetto_link=True):
    jax.device_put(host_x_small, single_sharding).block_until_ready()

This result was effectively identical to the named sharding case.

0 replies

YusukeSuzuki · 2025-05-21T05:55:40Z

YusukeSuzuki
May 21, 2025

I am also stuck with the same problem of poor performance with device_put(data, sharding).

I was gutsy and used time.time() and print() to profile. And I've found out that the execution time is being consumed by calls to shard_arg_handlers in shard_args() in jax/jax/_src/interpreters/pxla.py.

jax/jax/_src/interpreters/pxla.py

Line 160 in eef1f6c

outs = shard_arg_handlers[t](a, s, l, cs)

I don't think I can follow it from here as I am not at all proficient in XLA. But what is strange to me is that the data is internally passed around in a numpy ndarray.

In my case, I wrote a pre-process pipeline with td.data.Dataset, and then I wrote code to make it into a jnp.array without copying via dlpack and feed it to the GPU with device_put(). Why is jnp.array, which was created efficiently, turned into np.ndarray? I will look into it some more.

Anyway, there is something wrong with device_put(data, sharding) and I feel that it would make everyone happy to solve the root problem rather than looking for a workaround.

This post is translated from Japanese to English at DeepL.

Translated with DeepL.com (free version)

1 reply

YusukeSuzuki May 21, 2025

The statement that it is converted to ndarray may be due to my error. There is a part of the initialization that uses ndarray instead of jax's array, and I think I mistakenly thought that the whole thing was handled in ndarray when I saw the log of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficiently transferring a large array from CPU to multiple GPUs #27595

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Efficiently transferring a large array from CPU to multiple GPUs #27595

Uh oh!

Uh oh!

puct9 Mar 30, 2025

Replies: 2 comments · 1 reply

Uh oh!

puct9 Mar 31, 2025 Author

Uh oh!

YusukeSuzuki May 21, 2025

Uh oh!

YusukeSuzuki May 21, 2025

puct9
Mar 30, 2025

Replies: 2 comments 1 reply

puct9
Mar 31, 2025
Author

YusukeSuzuki
May 21, 2025