[shard_map + jax.lax.scan + vjp] How to reduce the number of communications #29608

PhilipVinc · 2025-06-20T11:35:40Z

PhilipVinc
Jun 20, 2025

I have a case where I have a function f(W: f32[M], x: f32[N,M]) -> f32[N] and I want to compute the vjp vjp(f, W, xs)(v : f32[N]) -> f32[M].
To lower the memory cost, I can perform the vjp in K batches of size N/K instead of N and reduce the output.
The natural way to implement this is using jax.lax.scan.

I now want to combine this with sharding the N dimension, so the notation would be f(W: f32[M], x: f32[N@s,M]) -> f32[N@s], however jax.lax.scan does not support sharding across the first dimension so I can resort to shard_map.

My issue is the following: the vjp of the replicated parameter W: f32[N] does an all-reduce (psum), correctly.
In the MWE example below there is therefore an all-reduce (psum) for every iteration of the scan.
But this is doing more work than necessary: I don't need to all reduce at every iteration, but could accumulate the vjp on every device differently, and then all reduce the different shards.

However, I don't know how to code this in jax. I feel I should declare W as jax.lax.pvary but this is not correct.
Does anybody have some insight?

import numpy as np
import jax
import jax.numpy as jnp
from jax import lax
from jax.sharding import Mesh, PartitionSpec as P, AxisType
from jax.experimental.shard import reshard
from jax.experimental.shard_map import shard_map

# ——— Setup 2 CPU devices into a 1-D mesh "s" ———
jax.config.update("jax_num_cpu_devices", 2)
devices = np.array(jax.devices())
mesh = Mesh(devices, ("s",), axis_types=(AxisType.Explicit,))
jax.sharding.set_mesh(mesh)

# ——— Model and data dims, plus a global batch size ———
M, N = 5, 64
batch_size = 16

# ——— Dummy data ———
key = jax.random.PRNGKey(0)
W = jax.random.normal(key, (M,))
xs = jax.random.normal(key, (N, M))
v  = jax.random.normal(key, (N,))

# ——— Simple linear model ———
def model(W, xs):
    return xs @ W  # → shape (batch,)

# ——— Shard the inputs over the “s” axis ———
xs_sh = reshard(xs, P("s", None))  # shape (N/2, M) per device
v_sh  = reshard(v,  P("s",))       # shape (N/2,)   per device

# ——— Reference: full‐dataset VJP on the sharded arrays ———
_, vjp_full = jax.vjp(lambda w: model(w, xs_sh), W)
grad_full = vjp_full(v_sh)[0]       # shape (M,)

# ——— shard_map + lax.scan version ———
def vjp_scan(W, xs_shard, v_shard):
    # reshape each device’s slice into (num_batches, batch_size, …)
    num_batches = xs_shard.shape[0] // batch_size
    xs_chunks = xs_shard.reshape((num_batches, batch_size, M))
    v_chunks  = v_shard.reshape((num_batches, batch_size))

    # I Have the feeling i should do something like
    # W = jax.lax.pvary(W, 's') but this is not allowed.

    # scan body: compute local gradient chunk and add to accumulator
    def body(acc, inputs):
        xs_c, v_c = inputs
        _, vjp_fn = jax.vjp(lambda w: model(w, xs_c), W)
        gW = vjp_fn(v_c)[0]
        return acc + gW, None

    # run the scan over all chunks
    grad_shard, _ = lax.scan(body, jnp.zeros_like(W), (xs_chunks, v_chunks))

    # do one all‐reduce (psum) across the “s” devices
    return grad_shard

# wrap it up in a shard_map
vjp_sm = shard_map(
    vjp_scan,
    mesh=mesh,
    in_specs=(P(None), P("s"), P("s")),
    out_specs=P('s'),
)

grad_shard = vjp_sm(W, xs_sh, v_sh)

# ——— Compare ———
print("grad_full:  ", grad_full)
print("grad_shard: ", grad_shard) # this is already reduced: the two shards are identical!

to maybe add some more context, if I look at the jaxpr of the vjp_sm function above I see that there is a psum_invariant inside the scan loop. How can I change my code to remove it (and then add one outside the scan loop)?

In [27]: jax.jit(vjp_sm).trace(W, xs_sh, v_sh).jaxpr
Out[27]: 
{ lambda ; a:f32[5] b:f32[64@s,5] c:f32[64@s]. let
    d:f32[10@s] = shard_map[
      check_vma=True
      in_specs=(PartitionSpec(None,), PartitionSpec('s',), PartitionSpec('s',))
....
          k:f32[5] = scan[
            _split_transpose=False
            jaxpr={ lambda ; l:f32[5] m:f32[5] n:f32[16,5]{V:s} o:f32[16]{V:s}. let
                p:f32[5]{V:s} = pvary[axes=('s',) axis_index_groups=None] l
                _:f32[16]{V:s} = dot_general[
                  dimension_numbers=(([1], [0]), ([], []))
                  preferred_element_type=float32
                ] n p
                q:f32[5]{V:s} = dot_general[
                  dimension_numbers=(([0], [0]), ([], []))
                  out_sharding=NamedSharding(mesh=AbstractMesh('s': 2, axis_types=(Manual,)), spec=PartitionSpec(None,))
                  preferred_element_type=float32
                ] o n
                r:f32[5] = psum_invariant[axes=('s',) axis_index_groups=None] q
                s:f32[5] = add m r
              in (s,) }
....

inailuig · 2025-06-24T14:00:16Z

inailuig
Jun 24, 2025

I don't see why pvary would not be allowed?

...

# ——— shard_map + lax.scan version ———
def vjp_scan(W, xs_shard, v_shard):
    print(W.shape, xs_shard.shape, v_shard.shape)
    # reshape each device’s slice into (num_batches, batch_size, …)
    num_batches = xs_shard.shape[0] // batch_size
    xs_chunks = xs_shard.reshape((num_batches, batch_size, M))
    v_chunks  = v_shard.reshape((num_batches, batch_size))

    W = jax.lax.pvary(W, 's')

    # scan body: compute local gradient chunk and add to accumulator
    def body(acc, inputs):
        xs_c, v_c = inputs
        _, vjp_fn = jax.vjp(lambda w: model(w, xs_c), W)
        gW = vjp_fn(v_c)[0]
        return acc + gW, None

    # run the scan over all chunks
    grad_shard, _ = lax.scan(body, jax.lax.pvary(jnp.zeros_like(W), 's'), (xs_chunks, v_chunks))

    # do one all‐reduce (psum) across the “s” devices
    return jax.lax.psum(grad_shard, "s")

# wrap it up in a shard_map
vjp_sm = shard_map(
    vjp_scan,
    mesh=mesh,
    in_specs=(P(None), P("s"), P("s")),
    out_specs=P(None),
)

...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[shard_map + jax.lax.scan + vjp] How to reduce the number of communications #29608

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[shard_map + jax.lax.scan + vjp] How to reduce the number of communications #29608

Uh oh!

Uh oh!

PhilipVinc Jun 20, 2025

Replies: 1 comment

Uh oh!

inailuig Jun 24, 2025

PhilipVinc
Jun 20, 2025

inailuig
Jun 24, 2025