Skip to content

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

rapsealk
Copy link
Member

@rapsealk rapsealk commented Apr 22, 2025

resolves #4243 (BA-1214)

This pull request introduces support for distributed training by adding environment variables for PyTorch and TensorFlow configurations. It also includes a minor import addition to support JSON serialization.

Distributed training support:

  • Added environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) to facilitate distributed training for PyTorch and TensorFlow. These variables are configured based on the cluster and kernel details in the get_image_conf function. (src/ai/backend/manager/registry.py, src/ai/backend/manager/registry.pyR1892-R1912)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--4244.org.readthedocs.build/en/4244/


📚 Documentation preview 📚: https://sorna-ko--4244.org.readthedocs.build/ko/4244/

@rapsealk rapsealk added this to the 25Q1 milestone Apr 22, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds support for distributed training by introducing environment variables for both PyTorch and TensorFlow, and includes a minor import update to facilitate JSON serialization.

  • Adds environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) for distributed training.
  • Introduces an import for dump_json_str to support JSON serialization in environment configuration.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/ai/backend/manager/registry.py Adds new environment variable configurations and JSON serialization.
changes/.feature.md Documents the feature addition for distributed training.
Comments suppressed due to low confidence (2)

src/ai/backend/manager/registry.py:1902

  • [nitpick] Consider renaming the variable in the list comprehension (e.g., to 'worker') to avoid shadowing the outer 'kernel' parameter for better clarity.
f"{kernel.cluster_hostname}:12345"

src/ai/backend/manager/registry.py:1897

  • [nitpick] Consider replacing the hardcoded port '12345' with a named constant or configurable value to improve maintainability.
"MASTER_PORT": "12345",

@github-actions github-actions bot added size:S 10~30 LoC comp:manager Related to Manager component labels Apr 22, 2025
.feature.md -> 4244.feature.md

Co-authored-by: octodog <[email protected]>
@rapsealk rapsealk changed the title feat(BA-1214): Add environment variables for distributed training feat(BA-1214): Add environment variables for PyTorch and TensorFlow distributed training Apr 22, 2025
@rapsealk rapsealk requested a review from leksikov April 22, 2025 05:18
@github-actions github-actions bot added size:M 30~100 LoC area:docs Documentations and removed size:S 10~30 LoC labels Apr 22, 2025
Copy link
Contributor

@leksikov leksikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.. list-table::
:header-rows: 1

* - Environment Variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add more variables support to cover various distributed frameworks?

For PyTorch Distributed:
PYTORCH_CUDA_ALLOC_CONF - For memory allocation strategies
NCCL_DEBUG - For debugging NCCL communications
TORCH_DISTRIBUTED_DEBUG - For more verbose debugging information

For DeepSpeed:
DEEPSPEED_ZERO_STAGE - To control ZeRO optimization stages
DEEPSPEED_ALLGATHER_SIZE - For tuning communication efficiency
DEEPSPEED_CPU_OFFLOAD - To enable CPU offloading

For Horovod:

HOROVOD_FUSION_THRESHOLD - For operation fusion tuning
HOROVOD_CYCLE_TIME - For controlling cycle time
HOROVOD_CACHE_CAPACITY - For tensor fusion cache

General:
OMP_NUM_THREADS - For controlling OpenMP parallelism

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback.

The primary objective of this pull request is to automatically configure environment variables necessary for distributed training with major frameworks (e.g., PyTorch, TensorFlow), particularly those related to inter-worker communication. These variables are deterministic, as the container and cluster topology is predefined.

In contrast, other environment variables—such as PYTORCH_CUDA_ALLOC_CONF—may vary depending on user preferences or optimization strategies. Therefore, I believe it is better not to auto-configure these, in order to avoid unintended side effects.

Comment on lines 1898 to 1900
"environ": {
**_pytorch_distributed_environ,
**_tf_distributed_environ,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you always apply the environment for pytorch and tf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-image basis would be better. Thanks!

@rapsealk rapsealk marked this pull request as draft April 24, 2025 01:40
@github-actions github-actions bot added the size:L 100~500 LoC label Apr 24, 2025
@rapsealk rapsealk requested a review from Copilot April 24, 2025 05:00
@github-actions github-actions bot added comp:common Related to Common component and removed size:M 30~100 LoC labels Apr 24, 2025
@rapsealk rapsealk marked this pull request as ready for review April 24, 2025 05:00
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for distributed training by introducing environment variables specific to PyTorch and TensorFlow configurations. Key changes include:

  • Adding two new Pydantic models, PyTorchDistributedEnviron and TensorFlowDistributedEnviron, for environment variable validation and serialization.
  • Integrating distributed training configuration into the image configuration logic in the registry.
  • Expanding tests to cover the new distributed environment models and updating documentation accordingly.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

File Description
tests/common/test_types.py Added tests for new distributed training environment types
src/ai/backend/manager/registry.py Integrated new environment models into the image config
src/ai/backend/common/types.py Introduced PyTorchDistributedEnviron and TensorFlowDistributedEnviron
changes/4244.feature.md Added feature summary for distributed training support
Files not reviewed (1)
  • docs/concepts/networking.rst: Language not supported
Comments suppressed due to low confidence (2)

src/ai/backend/manager/registry.py:1856

  • [nitpick] Consider defining a constant (e.g. DEFAULT_MASTER_PORT) for the hardcoded port value "12345" to improve maintainability and avoid potential issues if the default ever needs to change.
master_port="12345",

src/ai/backend/common/types.py:1655

  • Ensure that the 'override' decorator is imported or defined in the module to avoid a runtime error when overriding the model_dump method.
    @override

@rapsealk rapsealk changed the title feat(BA-1214): Add environment variables for PyTorch and TensorFlow distributed training feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:docs Documentations comp:common Related to Common component comp:manager Related to Manager component size:L 100~500 LoC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add default environment variables for PyTorch/TensorFlow distributed training
3 participants