(monarch_hyperactor) Create python binding for a RemoteAllocator that takes a list of remote channel addresses #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

kiukchung wants to merge 1 commit into main from export-D75928565

Contributor

kiukchung commented Jun 5, 2025

Summary:
To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. SlurmAllocator, KubernetesAllocator) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for RemoteAllocatorBase that takes a list of server addresses (in channel_addr format - e.g. metatls!devgpu032.nha1.facebook.com:26600 or tcp!devgpu032.nha1.facebook.com:26601) of remote-process-allocator server and connects to it.

The internals reuse existing RemoteProcessAlloc with a custom PyRemoteProcessAllocInitializer that simply returns a Vec<RemoteProcessAllocHost> given the user provided list of server addresses.

Recommended to start the review at monarch‎/python‎/tests‎/test_allocator.py‎ to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form {scheduler}://{namespace}/{job_id} e.g. slurm://default/monarch-kiuk-123) and returns an Allocator that can be used to create a ProcMesh as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

facebook-github-bot added the fb-exported label

kiukchung force-pushed the export-D75928565 branch from c215531 to 2422e42 Compare

June 5, 2025 14:49

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

2422e42

… takes a list of remote channel addresses (#170)

Summary:

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

4e0e34a

… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from 2422e42 to 4e0e34a Compare

June 5, 2025 14:52

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

4be469b

… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from 4e0e34a to 4be469b Compare

June 5, 2025 14:58

facebook-github-bot pushed a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

e1d1038

… takes a list of remote channel addresses (#170)

Summary:

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565

facebook-github-bot force-pushed the export-D75928565 branch from 4be469b to e1d1038 Compare

June 5, 2025 18:05

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

facebook-github-bot pushed a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

a3ab415

… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Differential Revision: D75928565

facebook-github-bot force-pushed the export-D75928565 branch from e1d1038 to a3ab415 Compare

June 5, 2025 22:36

Contributor

facebook-github-bot commented Jun 5, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

b2af73d

… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from a3ab415 to b2af73d Compare

June 10, 2025 18:13

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

85b0961

… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from b2af73d to 85b0961 Compare

June 10, 2025 18:13

Contributor

facebook-github-bot commented Jun 10, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

e90c354

… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from 85b0961 to e90c354 Compare

June 10, 2025 18:16

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

53d82a8

… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from e90c354 to 53d82a8 Compare

June 10, 2025 18:18

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

e5515fb

… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from 53d82a8 to e5515fb Compare

June 10, 2025 18:19

Contributor

facebook-github-bot commented Jun 10, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

cc962c4

… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from e5515fb to cc962c4 Compare

June 10, 2025 18:21

Contributor

facebook-github-bot commented Jun 10, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

e0a1354

… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from cc962c4 to e0a1354 Compare

June 10, 2025 18:27

facebook-github-bot pushed a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

a456f48

… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

facebook-github-bot force-pushed the export-D75928565 branch from e0a1354 to a456f48 Compare

June 10, 2025 18:27

Contributor

facebook-github-bot commented Jun 10, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

1 similar comment

Contributor

facebook-github-bot commented Jun 10, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

7052fa5

… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

kiukchung force-pushed the export-D75928565 branch from a456f48 to 7052fa5 Compare

June 10, 2025 18:34


          (monarch_hyperactor) Create python binding for a RemoteAllocator that…

de51bc8

… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565

facebook-github-bot force-pushed the export-D75928565 branch from 7052fa5 to de51bc8 Compare

June 10, 2025 18:59

Contributor

facebook-github-bot commented Jun 10, 2025

This pull request was exported from Phabricator. Differential Revision: D75928565

facebook-github-bot closed this in

b844326

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Jun 10, 2025

This pull request has been merged in b844326.

hemildesai mentioned this pull request

Multi node example #245

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged