feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244

rapsealk · 2025-04-22T05:13:12Z

This pull request introduces support for distributed training by adding environment variables for PyTorch and TensorFlow configurations. It also includes a minor import addition to support JSON serialization.

Distributed training support:

Added environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) to facilitate distributed training for PyTorch and TensorFlow. These variables are configured based on the cluster and kernel details in the get_image_conf function. (src/ai/backend/manager/registry.py, src/ai/backend/manager/registry.pyR1892-R1912)

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--4244.org.readthedocs.build/en/4244/

📚 Documentation preview 📚: https://sorna-ko--4244.org.readthedocs.build/ko/4244/

…ning

Copilot

Pull Request Overview

This pull request adds support for distributed training by introducing environment variables for both PyTorch and TensorFlow, and includes a minor import update to facilitate JSON serialization.

Adds environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) for distributed training.
Introduces an import for dump_json_str to support JSON serialization in environment configuration.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/ai/backend/manager/registry.py	Adds new environment variable configurations and JSON serialization.
changes/.feature.md	Documents the feature addition for distributed training.

Comments suppressed due to low confidence (2)

src/ai/backend/manager/registry.py:1902

[nitpick] Consider renaming the variable in the list comprehension (e.g., to 'worker') to avoid shadowing the outer 'kernel' parameter for better clarity.

f"{kernel.cluster_hostname}:12345"

src/ai/backend/manager/registry.py:1897

[nitpick] Consider replacing the hardcoded port '12345' with a named constant or configurable value to improve maintainability.

"MASTER_PORT": "12345",

.feature.md -> 4244.feature.md Co-authored-by: octodog <[email protected]>

…ment-variables

leksikov

LGTM

leksikov · 2025-04-22T05:48:35Z

docs/concepts/networking.rst

+.. list-table::
+   :header-rows: 1
+
+   * - Environment Variable


How about add more variables support to cover various distributed frameworks?

For PyTorch Distributed:
PYTORCH_CUDA_ALLOC_CONF - For memory allocation strategies
NCCL_DEBUG - For debugging NCCL communications
TORCH_DISTRIBUTED_DEBUG - For more verbose debugging information

For DeepSpeed:
DEEPSPEED_ZERO_STAGE - To control ZeRO optimization stages
DEEPSPEED_ALLGATHER_SIZE - For tuning communication efficiency
DEEPSPEED_CPU_OFFLOAD - To enable CPU offloading

For Horovod:

HOROVOD_FUSION_THRESHOLD - For operation fusion tuning
HOROVOD_CYCLE_TIME - For controlling cycle time
HOROVOD_CACHE_CAPACITY - For tensor fusion cache

General:
OMP_NUM_THREADS - For controlling OpenMP parallelism

Thank you for the feedback.

The primary objective of this pull request is to automatically configure environment variables necessary for distributed training with major frameworks (e.g., PyTorch, TensorFlow), particularly those related to inter-worker communication. These variables are deterministic, as the container and cluster topology is predefined.

In contrast, other environment variables—such as PYTORCH_CUDA_ALLOC_CONF—may vary depending on user preferences or optimization strategies. Therefore, I believe it is better not to auto-configure these, in order to avoid unintended side effects.

HyeockJinKim · 2025-04-24T01:18:19Z

src/ai/backend/manager/registry.py

                            "environ": {
+                                **_pytorch_distributed_environ,
+                                **_tf_distributed_environ,


Do you always apply the environment for pytorch and tf?

Per-image basis would be better. Thanks!

…ment-variables

Copilot

Pull Request Overview

This PR adds support for distributed training by introducing environment variables specific to PyTorch and TensorFlow configurations. Key changes include:

Adding two new Pydantic models, PyTorchDistributedEnviron and TensorFlowDistributedEnviron, for environment variable validation and serialization.
Integrating distributed training configuration into the image configuration logic in the registry.
Expanding tests to cover the new distributed environment models and updating documentation accordingly.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

File	Description
tests/common/test_types.py	Added tests for new distributed training environment types
src/ai/backend/manager/registry.py	Integrated new environment models into the image config
src/ai/backend/common/types.py	Introduced PyTorchDistributedEnviron and TensorFlowDistributedEnviron
changes/4244.feature.md	Added feature summary for distributed training support

Files not reviewed (1)

docs/concepts/networking.rst: Language not supported

Comments suppressed due to low confidence (2)

src/ai/backend/manager/registry.py:1856

[nitpick] Consider defining a constant (e.g. DEFAULT_MASTER_PORT) for the hardcoded port value "12345" to improve maintainability and avoid potential issues if the default ever needs to change.

master_port="12345",

src/ai/backend/common/types.py:1655

Ensure that the 'override' decorator is imported or defined in the module to avoid a runtime error when overriding the model_dump method.

    @override

tests/common/test_types.py

rapsealk added 2 commits April 22, 2025 12:04

feat(BA-1214): Add default environment variables for distributed trai…

809fdf5

…ning

docs(BA-1214): Add news fragment

6d8e2f1

rapsealk added this to the 25Q1 milestone Apr 22, 2025

rapsealk requested review from HyeockJinKim and Copilot April 22, 2025 05:13

github-actions bot assigned rapsealk Apr 22, 2025

Copilot AI reviewed Apr 22, 2025

View reviewed changes

github-actions bot added size:S 10~30 LoC comp:manager Related to Manager component labels Apr 22, 2025

docs: Rename the news fragment with the PR number

48edcbd

.feature.md -> 4244.feature.md Co-authored-by: octodog <[email protected]>

rapsealk changed the title ~~feat(BA-1214): Add environment variables for distributed training~~ feat(BA-1214): Add environment variables for PyTorch and TensorFlow distributed training Apr 22, 2025

rapsealk requested a review from leksikov April 22, 2025 05:18

docs(BA-1214): Update docs

8396628

github-actions bot added size:M 30~100 LoC area:docs Documentations and removed size:S 10~30 LoC labels Apr 22, 2025

rapsealk added 2 commits April 22, 2025 16:19

refactor(BA-1214): Separate distributed environ

616f4e5

Merge branch 'main' into feature/BA-1214-distributed-training-environ…

97e4e10

…ment-variables

leksikov reviewed Apr 23, 2025

View reviewed changes

HyeockJinKim reviewed Apr 24, 2025

View reviewed changes

rapsealk marked this pull request as draft April 24, 2025 01:40

rapsealk added 2 commits April 24, 2025 11:16

refactor(BA-1214): Per-image basis

7ad4826

feat(BA-1214): Add pydantic models

9cda165

rapsealk requested review from leksikov and HyeockJinKim April 24, 2025 04:59

github-actions bot added the size:L 100~500 LoC label Apr 24, 2025

rapsealk requested a review from Copilot April 24, 2025 05:00

github-actions bot added comp:common Related to Common component and removed size:M 30~100 LoC labels Apr 24, 2025

rapsealk marked this pull request as ready for review April 24, 2025 05:00

Merge branch 'main' into feature/BA-1214-distributed-training-environ…

84a4f65

…ment-variables

Copilot AI reviewed Apr 24, 2025

View reviewed changes

github-advanced-security bot found potential problems Apr 24, 2025

View reviewed changes

tests/common/test_types.py Dismissed Show resolved Hide resolved

tests/common/test_types.py Dismissed Show dismissed Hide dismissed

rapsealk changed the title ~~feat(BA-1214): Add environment variables for PyTorch and TensorFlow distributed training~~ feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow Apr 24, 2025

rapsealk added 2 commits April 24, 2025 14:31

docs(BA-1214): Update news fragment

56e30b8

docs(BA-1214): Update news fragment

295023e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244

rapsealk commented Apr 22, 2025 •

edited by github-actions bot

Loading

Copilot AI left a comment

leksikov left a comment

leksikov Apr 22, 2025

rapsealk Apr 24, 2025

HyeockJinKim Apr 24, 2025

rapsealk Apr 24, 2025

Copilot AI left a comment

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244

Are you sure you want to change the base?

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244

Conversation

rapsealk commented Apr 22, 2025 • edited by github-actions bot Loading

Distributed training support:

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

leksikov left a comment

Choose a reason for hiding this comment

leksikov Apr 22, 2025

Choose a reason for hiding this comment

rapsealk Apr 24, 2025

Choose a reason for hiding this comment

HyeockJinKim Apr 24, 2025

Choose a reason for hiding this comment

rapsealk Apr 24, 2025

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

rapsealk commented Apr 22, 2025 •

edited by github-actions bot

Loading