-
Notifications
You must be signed in to change notification settings - Fork 161
feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow #4244
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds support for distributed training by introducing environment variables for both PyTorch and TensorFlow, and includes a minor import update to facilitate JSON serialization.
- Adds environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) for distributed training.
- Introduces an import for dump_json_str to support JSON serialization in environment configuration.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
src/ai/backend/manager/registry.py | Adds new environment variable configurations and JSON serialization. |
changes/.feature.md | Documents the feature addition for distributed training. |
Comments suppressed due to low confidence (2)
src/ai/backend/manager/registry.py:1902
- [nitpick] Consider renaming the variable in the list comprehension (e.g., to 'worker') to avoid shadowing the outer 'kernel' parameter for better clarity.
f"{kernel.cluster_hostname}:12345"
src/ai/backend/manager/registry.py:1897
- [nitpick] Consider replacing the hardcoded port '12345' with a named constant or configurable value to improve maintainability.
"MASTER_PORT": "12345",
.feature.md -> 4244.feature.md Co-authored-by: octodog <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
.. list-table:: | ||
:header-rows: 1 | ||
|
||
* - Environment Variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add more variables support to cover various distributed frameworks?
For PyTorch Distributed:
PYTORCH_CUDA_ALLOC_CONF - For memory allocation strategies
NCCL_DEBUG - For debugging NCCL communications
TORCH_DISTRIBUTED_DEBUG - For more verbose debugging information
For DeepSpeed:
DEEPSPEED_ZERO_STAGE - To control ZeRO optimization stages
DEEPSPEED_ALLGATHER_SIZE - For tuning communication efficiency
DEEPSPEED_CPU_OFFLOAD - To enable CPU offloading
For Horovod:
HOROVOD_FUSION_THRESHOLD - For operation fusion tuning
HOROVOD_CYCLE_TIME - For controlling cycle time
HOROVOD_CACHE_CAPACITY - For tensor fusion cache
General:
OMP_NUM_THREADS - For controlling OpenMP parallelism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback.
The primary objective of this pull request is to automatically configure environment variables necessary for distributed training with major frameworks (e.g., PyTorch, TensorFlow), particularly those related to inter-worker communication. These variables are deterministic, as the container and cluster topology is predefined.
In contrast, other environment variables—such as PYTORCH_CUDA_ALLOC_CONF
—may vary depending on user preferences or optimization strategies. Therefore, I believe it is better not to auto-configure these, in order to avoid unintended side effects.
src/ai/backend/manager/registry.py
Outdated
"environ": { | ||
**_pytorch_distributed_environ, | ||
**_tf_distributed_environ, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you always apply the environment for pytorch and tf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-image basis would be better. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for distributed training by introducing environment variables specific to PyTorch and TensorFlow configurations. Key changes include:
- Adding two new Pydantic models, PyTorchDistributedEnviron and TensorFlowDistributedEnviron, for environment variable validation and serialization.
- Integrating distributed training configuration into the image configuration logic in the registry.
- Expanding tests to cover the new distributed environment models and updating documentation accordingly.
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/common/test_types.py | Added tests for new distributed training environment types |
src/ai/backend/manager/registry.py | Integrated new environment models into the image config |
src/ai/backend/common/types.py | Introduced PyTorchDistributedEnviron and TensorFlowDistributedEnviron |
changes/4244.feature.md | Added feature summary for distributed training support |
Files not reviewed (1)
- docs/concepts/networking.rst: Language not supported
Comments suppressed due to low confidence (2)
src/ai/backend/manager/registry.py:1856
- [nitpick] Consider defining a constant (e.g. DEFAULT_MASTER_PORT) for the hardcoded port value "12345" to improve maintainability and avoid potential issues if the default ever needs to change.
master_port="12345",
src/ai/backend/common/types.py:1655
- Ensure that the 'override' decorator is imported or defined in the module to avoid a runtime error when overriding the model_dump method.
@override
resolves #4243 (BA-1214)
This pull request introduces support for distributed training by adding environment variables for PyTorch and TensorFlow configurations. It also includes a minor import addition to support JSON serialization.
Distributed training support:
WORLD_SIZE
,WORLD_RANK
,LOCAL_RANK
,MASTER_ADDR
,MASTER_PORT
,TF_CONFIG
) to facilitate distributed training for PyTorch and TensorFlow. These variables are configured based on the cluster and kernel details in theget_image_conf
function. (src/ai/backend/manager/registry.py
, src/ai/backend/manager/registry.pyR1892-R1912)Checklist: (if applicable)
docs
directory📚 Documentation preview 📚: https://sorna--4244.org.readthedocs.build/en/4244/
📚 Documentation preview 📚: https://sorna-ko--4244.org.readthedocs.build/ko/4244/