Fixes for multi-node execution with torchrun + LocalExecutor in Slurm environment #251

pramodk · 2025-06-01T20:21:24Z

Summary

execute job's "prepare" stage only from single process / first rank
for --node-rank argument to torchrun, look for SLURM_NODEID as well

Issues Addressed

Multi-node execution with torchrun+LocalExecutor was mentioned in #130 but I don't think this feature has been tested thoroughly. This PR fixes two issues that I saw while testing multi-node execution with torchrun + localexecutor:

1. args to torchrun are not expanded properly

[NeMo W 2025-06-01 08:41:50 nemo_logging:361] /root/nemo-conda-home/nemo-conda-install-june1/conda_envs/nemo/lib/python3.10/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
      warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)

──────────────────────────────────────────────── Entering Experiment nemotron3_4b_pretraining with id: nemotron3_4b_pretraining_1748767313 ─────────────────────────────────────────────────
[08:41:53] INFO     Log directory is: /root/.nemo_run/experiments/nemotron3_4b_pretraining/nemotron3_4b_pretraining_1748767313/nemotron3_4b_pretraining               local_scheduler.py:777
[08:41:53] Launching job nemotron3_4b_pretraining for experiment nemotron3_4b_pretraining                                                                                  experiment.py:771
           INFO     Log directory is: /root/.nemo_run/experiments/nemotron3_4b_pretraining/nemotron3_4b_pretraining_1748767313/nemotron3_4b_pretraining               local_scheduler.py:777
           INFO     Launched app: local_persistent://nemo_run/nemotron3_4b_pretraining-xhsmxvn2slqt7                                                                         launcher.py:111
─────────────────────────────────────────────────────────── Waiting for Experiment nemotron3_4b_pretraining_1748767313 to finish ───────────────────────────────────────────────────────────

Experiment Status for nemotron3_4b_pretraining_1748767313

Task 0: nemotron3_4b_pretraining
- Status: RUNNING
- Executor: LocalExecutor
- Job id: nemotron3_4b_pretraining-xhsmxvn2slqt7
- Local Directory: /root/.nemo_run/experiments/nemotron3_4b_pretraining/nemotron3_4b_pretraining_1748767313/nemotron3_4b_pretraining

           INFO     Waiting for job nemotron3_4b_pretraining-xhsmxvn2slqt7 to finish [log=True]...                                                                           launcher.py:131
retraining/0 usage: torchrun [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
retraining/0                 [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT]
retraining/0                 [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone]
retraining/0                 [--max-restarts MAX_RESTARTS]
retraining/0                 [--monitor-interval MONITOR_INTERVAL]
retraining/0                 [--start-method {spawn,fork,forkserver}] [--role ROLE] [-m]
retraining/0                 [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
retraining/0                 [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER]
retraining/0                 [--node-rank NODE_RANK] [--master-addr MASTER_ADDR]
retraining/0                 [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
retraining/0                 [--logs-specs LOGS_SPECS]
retraining/0                 training_script ...
retraining/0 torchrun: error: argument --node-rank/--node_rank: invalid int value: '$${node_rank_var}'
[08:41:55] INFO     Job nemotron3_4b_pretraining-xhsmxvn2slqt7 finished: FAILED                                                                                              launcher.py:161

In this case, $${node_rank_var} is not expanded properly.

2. prepare stage is not "protected" for execution from multiple ranks / processes

We typically run multi-node execution job as:

srun -N ${SLURM_NNODES} --ntasks-per-node=1 -n ${SLURM_NNODES} python train.py

As multiple processes are executing from the beginning, we see errors like:

Experiment Status for nemotron3_4b_pretraining_1748772668
Task 0: nemotron3_4b_pretraining
- Status: RUNNING
- Executor: LocalExecutor
- Job id: nemotron3_4b_pretraining-cpj990hrxhhwk
- Local Directory: /root/.nemo_run/experiments/nemotron3_4b_pretraining/nemotron3_4b_pretraining_1748772668/nemotron3_4b_pretraining
           INFO     Waiting for job                              launcher.py:131
                    nemotron3_4b_pretraining-cpj990hrxhhwk to
                    finish [log=True]...
─ Entering Experiment nemotron3_4b_pretraining with id: nemotron3_4b_pretrain… ─
Traceback (most recent call last):
  File "/root/nemo-conda-home/test/nemo-run/test.py", line 38, in <module>
    run_pretraining()
  File "/root/nemo-conda-home/test/nemo-run/test.py", line 35, in run_pretraining
    run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")
  File "/root/nemo-conda-home/NeMo-Run/nemo_run/run/api.py", line 88, in run
    exp.run(detach=detach)
  File "/root/nemo-conda-home/NeMo-Run/nemo_run/run/experiment.py", line 655, in run
    self._prepare()
  File "/root/nemo-conda-home/NeMo-Run/nemo_run/run/experiment.py", line 421, in _prepare
    self._save_experiment(exist_ok=exist_ok)
  File "/root/nemo-conda-home/NeMo-Run/nemo_run/run/experiment.py", line 362, in _save_experiment
    os.makedirs(self._exp_dir, exist_ok=exist_ok)
  File "/usr/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/root/.nemo_run/experiments/nemotron3_4b_pretraining/nemotron3_4b_pretraining_1748772668'

Testing

An example with torchrun+localexecutor:

import nemo_run as run
from nemo.collections import llm
import os

def configure_recipe(nodes, gpus_per_node):
    recipe = llm.llama32_1b.pretrain_recipe(
        dir="./checkpoints",
        name="llama_test",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )
    recipe.model.config.apply_rope_fusion = False
    recipe.trainer.val_check_interval = 20
    recipe.trainer.max_steps = 20
    recipe.trainer.log_every_n_steps = 10
    return recipe

def local_executor_torchrun(nodes, devices) -> run.LocalExecutor:
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
    }
    executor = run.LocalExecutor(ntasks_per_node=devices, nodes=nodes, launcher="torchrun", env_vars=env_vars)
    return executor

def run_pretraining():
    nodes = int(os.getenv("SLURM_NNODES", 1))
    gpus_per_node = int(os.getenv("SLURM_GPUS_ON_NODE", 1))
    recipe = configure_recipe(nodes,gpus_per_node)
    executor = local_executor_torchrun(nodes=nodes, devices=gpus_per_node)
    run.run(recipe, executor=executor, name="nemotron3_4b_pretraining")

if __name__ == "__main__":
    run_pretraining()

Job script:

#!/bin/bash
#SBATCH --job-name=torchrun_slurm
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --time=06:00:00
#SBATCH --partition=main
#SBATCH --account=gcore

# load environment with source installation
source /root/nemo-conda-home/nemo-run-conda-install/nemo_env.sh

# torchrun setup
nodes=( $(scontrol show hostnames $SLURM_JOB_NODELIST) )
export MASTER_ADDR=${nodes[0]}
export MASTER_PORT=9999

srun -N ${SLURM_NNODES} --ntasks-per-node=1 -n ${SLURM_NNODES} python train.py

Additional Notes

Note that further improvements might be needed such as logging from only from single rank because currently we see some logging messages from all ranks:

[20:17:01] INFO     Log directory is:                     local_scheduler.py:777
                    /root/.nemo_run/experiments/llama32_1
                    b/llama32_1b_1748809021/llama32_1b
[20:17:01] Launching job llama32_1b for experiment llama32_1b  experiment.py:774
           INFO     Log directory is:                     local_scheduler.py:777
                    /root/.nemo_run/experiments/llama32_1
                    b/llama32_1b_1748809021/llama32_1b
           INFO     Launched app:                                launcher.py:111
                    local_persistent://nemo_run/llama32_1b-wct9c
                    6wzw2fnhd
──────────── Waiting for Experiment llama32_1b_1748809021 to finish ────────────
Experiment Status for llama32_1b_1748809021
Task 0: llama32_1b
- Status: RUNNING
- Executor: LocalExecutor
- Job id: llama32_1b-wct9c6wzw2fnhd
- Local Directory: /root/.nemo_run/experiments/llama32_1b/llama32_1b_1748809021/llama32_1b

           INFO     Waiting for job llama32_1b-wct9c6wzw2fnhd to launcher.py:131
                    finish [log=True]...
llama32_1b/0 I0601 20:17:03.062000 616467 site-packages/torch/distributed/run.py:649] Using nproc_per_node=8.
llama32_1b/0 W0601 20:17:03.062000 616467 site-packages/torch/distributed/run.py:766]
llama32_1b/0 W0601 20:17:03.062000 616467 site-packages/torch/distributed/run.py:766] *****************************************
llama32_1b/0 W0601 20:17:03.062000 616467 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
llama32_1b/0 W0601 20:17:03.062000 616467 site-packages/torch/distributed/run.py:766] *****************************************
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195] Starting elastic_operator with launch configs:
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   entrypoint       : nemo_run.core.runners.fdl_runner
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   min_nodes        : 2
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   max_nodes        : 2
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   nproc_per_node   : 8
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   run_id           : 6844
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   rdzv_backend     : c10d
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   rdzv_endpoint    : worker-0:9999
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   rdzv_configs     : {'timeout': 900}
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   max_restarts     : 0
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   monitor_interval : 0.1
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   log_dir          : /root/.nemo_run/experiments/llama32_1b/llama32_1b_1748809021/llama32_1b/nemo_run/llama32_1b-wct9c6wzw2fnhd/torchelastic/llama32_1b
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]   metrics_cfg      : {}
llama32_1b/0 I0601 20:17:03.066000 616467 site-packages/torch/distributed/launcher/api.py:195]
llama32_1b/0 I0601 20:17:03.072000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:860] [default] starting workers for entrypoint: python3.10
llama32_1b/0 I0601 20:17:03.073000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:677] [default] Rendezvous'ing worker group
──────── Entering Experiment llama32_1b with id: llama32_1b_1748809025 ─────────
[20:17:06] INFO     Log directory is:                     local_scheduler.py:777
                    /root/.nemo_run/experiments/llama32_1
                    b/llama32_1b_1748809025/llama32_1b
[20:17:06] Launching job llama32_1b for experiment llama32_1b  experiment.py:774
           INFO     Log directory is:                     local_scheduler.py:777
                    /root/.nemo_run/experiments/llama32_1
                    b/llama32_1b_1748809025/llama32_1b
           INFO     Launched app:                                launcher.py:111
                    local_persistent://nemo_run/llama32_1b-f33qx
                    zxr2ng7wc
──────────── Waiting for Experiment llama32_1b_1748809025 to finish ────────────

Experiment Status for llama32_1b_1748809025

Task 0: llama32_1b
- Status: RUNNING
- Executor: LocalExecutor
- Job id: llama32_1b-f33qxzxr2ng7wc
- Local Directory: /root/.nemo_run/experiments/llama32_1b/llama32_1b_1748809025/llama32_1b

           INFO     Waiting for job llama32_1b-f33qxzxr2ng7wc to launcher.py:131
                    finish [log=True]...
llama32_1b/0 I0601 20:17:07.561000 546423 site-packages/torch/distributed/run.py:649] Using nproc_per_node=8.
llama32_1b/0 W0601 20:17:07.561000 546423 site-packages/torch/distributed/run.py:766]
llama32_1b/0 W0601 20:17:07.561000 546423 site-packages/torch/distributed/run.py:766] *****************************************
llama32_1b/0 W0601 20:17:07.561000 546423 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
llama32_1b/0 W0601 20:17:07.561000 546423 site-packages/torch/distributed/run.py:766] *****************************************
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195] Starting elastic_operator with launch configs:
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   entrypoint       : nemo_run.core.runners.fdl_runner
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   min_nodes        : 2
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   max_nodes        : 2
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   nproc_per_node   : 8
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   run_id           : 6844
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   rdzv_backend     : c10d
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   rdzv_endpoint    : worker-0:9999
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   rdzv_configs     : {'timeout': 900}
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   max_restarts     : 0
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   monitor_interval : 0.1
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   log_dir          : /root/.nemo_run/experiments/llama32_1b/llama32_1b_1748809025/llama32_1b/nemo_run/llama32_1b-f33qxzxr2ng7wc/torchelastic/llama32_1b
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]   metrics_cfg      : {}
llama32_1b/0 I0601 20:17:07.564000 546423 site-packages/torch/distributed/launcher/api.py:195]
llama32_1b/0 I0601 20:17:07.596000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:860] [default] starting workers for entrypoint: python3.10
llama32_1b/0 I0601 20:17:07.596000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:677] [default] Rendezvous'ing worker group
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   restart_count=0
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   master_addr=worker-0.slurm-cluster-worker-svc.slurm-cluster.svc.cluster.local
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   master_port=40009
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   group_rank=0
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   group_world_size=2
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]   global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:525]
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/api.py:685] [default] Starting worker group
llama32_1b/0 I0601 20:17:08.127000 616467 site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py:298] use_agent_store: True
llama32_1b/0 I0601 20:17:08.128000 616467 site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
llama32_1b/0 I0601 20:17:08.128000 616467 site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py:236] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   restart_count=0
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   master_addr=worker-0.slurm-cluster-worker-svc.slurm-cluster.svc.cluster.local
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   master_port=40009
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   group_rank=1
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   group_world_size=2
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   role_ranks=[8, 9, 10, 11, 12, 13, 14, 15]
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   global_ranks=[8, 9, 10, 11, 12, 13, 14, 15]
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   role_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]   global_world_sizes=[16, 16, 16, 16, 16, 16, 16, 16]
llama32_1b/0 I0601 20:17:08.128000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:525]
llama32_1b/0 I0601 20:17:08.129000 546423 site-packages/torch/distributed/elastic/agent/server/api.py:685] [default] Starting worker group
llama32_1b/0 I0601 20:17:08.129000 546423 site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py:298] use_agent_store: True
llama32_1b/0 I0601 20:17:08.129000 546423 site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py:192] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
llama32_1b/0 I0601 20:17:08.129000 546423 site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py:236] Environment variable 'TORCHELASTIC_HEALTH_CHECK_PORT' not found. Do not start health check.
llama32_1b/0 [default0]:[NeMo W 2025-06-01 20:17:21 nemo_logging:361] /root/nemo-conda-home/nemo-run-conda-install/conda_envs/nemo/lib/python3.10/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work

But with this PR, I wanted to at least get a working example.

- do prepare stage only from single process or rank - for --node-rank, also look for SLURM_NODEID Signed-off-by: Pramod Kumbhar <[email protected]>

hemildesai · 2025-06-03T23:58:59Z

nemo_run/run/experiment.py

-        self._prepare()
+
+        # in case of multi-node execution with LocalExecutor+torchrun+slurm, run only on first rank
+        if int(os.getenv("SLURM_PROCID", 0)) == 0:


Since experiment.py is not specific to Slurm, can you rename this env var to something like NEMORUN_SKIP_EXPERIMENT_PREPARE?

@hemildesai : this variable is automatically set by Slurm for each rank (i.e. 0 to N-1). As users will use this often under Slurm, this will automatically work.

If we rename this to NEMORUN_SKIP_EXPERIMENT_PREPARE then user can not set this directly at global level like export NEMORUN_SKIP_EXPERIMENT_PREPARE=0 because each process will have same value and hence execute prepare stage.

Typically we "workaround" like below but this is more convoluted and launcher dependent:

srun -n 4 bash ./run_mpi_task.sh

where run_mpi_task.sh would be

#!/bin/bash OMPI_RANK=${OMPI_COMM_WORLD_RANK:-0} #openmpi-specific export NEMORUN_SKIP_EXPERIMENT_PREPARE =$OMPI_RANK <actual exe>

Fixes for multi-node execution with torchrun + LocalExecutor

b7d57fe

- do prepare stage only from single process or rank - for --node-rank, also look for SLURM_NODEID Signed-off-by: Pramod Kumbhar <[email protected]>

hemildesai reviewed Jun 3, 2025

View reviewed changes

bhaddow mentioned this pull request Jun 9, 2025

NCCL error when trying to run NeMo training multi node #266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes for multi-node execution with torchrun + LocalExecutor in Slurm environment #251

Fixes for multi-node execution with torchrun + LocalExecutor in Slurm environment #251

Uh oh!

pramodk commented Jun 1, 2025

Uh oh!

hemildesai Jun 3, 2025

Uh oh!

pramodk Jun 4, 2025

Uh oh!

Uh oh!

Fixes for multi-node execution with torchrun + LocalExecutor in Slurm environment #251

Are you sure you want to change the base?

Fixes for multi-node execution with torchrun + LocalExecutor in Slurm environment #251

Uh oh!

Conversation

pramodk commented Jun 1, 2025

Summary

Issues Addressed

1. args to torchrun are not expanded properly

2. prepare stage is not "protected" for execution from multiple ranks / processes

Testing

Additional Notes

Uh oh!

hemildesai Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

pramodk Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!