-
Notifications
You must be signed in to change notification settings - Fork 41
(monarch_hyperactor) Create python binding for a RemoteAllocator that takes a list of remote channel addresses #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D75928565 |
c215531
to
2422e42
Compare
… takes a list of remote channel addresses (#170) Summary: To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual. NOTE: WIP fixing type-checking failures so ignore those... Differential Revision: D75928565
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: Pull Request resolved: #170 To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual. NOTE: WIP fixing type-checking failures so ignore those... Differential Revision: D75928565
2422e42
to
4e0e34a
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: Pull Request resolved: #170 To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual. NOTE: WIP fixing type-checking failures so ignore those... Differential Revision: D75928565
4e0e34a
to
4be469b
Compare
… takes a list of remote channel addresses (#170) Summary: To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual. NOTE: WIP fixing type-checking failures so ignore those... Differential Revision: D75928565
4be469b
to
e1d1038
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Differential Revision: D75928565
e1d1038
to
a3ab415
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Differential Revision: D75928565
a3ab415
to
b2af73d
Compare
… takes a list of remote channel addresses (#170) Summary: See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Differential Revision: D75928565
b2af73d
to
85b0961
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: Pull Request resolved: #170 See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
85b0961
to
e90c354
Compare
… takes a list of remote channel addresses (#170) Summary: See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
e90c354
to
53d82a8
Compare
… takes a list of remote channel addresses (#170) Summary: See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
53d82a8
to
e5515fb
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: Pull Request resolved: #170 See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
e5515fb
to
cc962c4
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: Pull Request resolved: #170 See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
cc962c4
to
e0a1354
Compare
… takes a list of remote channel addresses (#170) Summary: See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
e0a1354
to
a456f48
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D75928565 |
… takes a list of remote channel addresses (#170) Summary: Pull Request resolved: #170 See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
a456f48
to
7052fa5
Compare
… takes a list of remote channel addresses (#170) Summary: See: P1833193535 for how the user-facing UX would look like. NOTE: Recommended to start the review at `monarch/python/tests/test_allocator.py` to get a sense of what the API/Usage looks like. NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames. To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic. This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it. The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses. ## Next Steps: 1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job. 2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST Reviewed By: technicianted Differential Revision: D75928565
7052fa5
to
de51bc8
Compare
This pull request was exported from Phabricator. Differential Revision: D75928565 |
This pull request has been merged in b844326. |
Summary:
To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g.
SlurmAllocator
,KubernetesAllocator
) we take advantage of the infrastructure we already have in TorchX and TorchElastic.This Diff creates Python bindings for
RemoteAllocatorBase
that takes a list of server addresses (in channel_addr format - e.g.metatls!devgpu032.nha1.facebook.com:26600
ortcp!devgpu032.nha1.facebook.com:26601
) of remote-process-allocator server and connects to it.The internals reuse existing
RemoteProcessAlloc
with a customPyRemoteProcessAllocInitializer
that simply returns aVec<RemoteProcessAllocHost>
given the user provided list of server addresses.Recommended to start the review at
monarch/python/tests/test_allocator.py
to get a sense of what the API/Usage looks like.The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form
{scheduler}://{namespace}/{job_id}
e.g.slurm://default/monarch-kiuk-123
) and returns an Allocator that can be used to create aProcMesh
as usual.NOTE: WIP fixing type-checking failures so ignore those...
Differential Revision: D75928565