Skip to content

(monarch_hyperactor) Create python binding for a RemoteAllocator that takes a list of remote channel addresses #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kiukchung
Copy link
Contributor

Summary:
To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. SlurmAllocator, KubernetesAllocator) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for RemoteAllocatorBase that takes a list of server addresses (in channel_addr format - e.g. metatls!devgpu032.nha1.facebook.com:26600 or tcp!devgpu032.nha1.facebook.com:26601) of remote-process-allocator server and connects to it.

The internals reuse existing RemoteProcessAlloc with a custom PyRemoteProcessAllocInitializer that simply returns a Vec<RemoteProcessAllocHost> given the user provided list of server addresses.

Recommended to start the review at monarch‎/python‎/tests‎/test_allocator.py‎ to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form {scheduler}://{namespace}/{job_id} e.g. slurm://default/monarch-kiuk-123) and returns an Allocator that can be used to create a ProcMesh as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 5, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 5, 2025
… takes a list of remote channel addresses (#170)

Summary:

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 5, 2025
… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 5, 2025
… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565
facebook-github-bot pushed a commit that referenced this pull request Jun 5, 2025
… takes a list of remote channel addresses (#170)

Summary:

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

The next diff will provide a function that gets the list of server addresses given a job-id (more specifically a monarch server handle of the form `{scheduler}://{namespace}/{job_id}` e.g. `slurm://default/monarch-kiuk-123`) and returns an Allocator that can be used to create a `ProcMesh` as usual.

NOTE: WIP fixing type-checking failures so ignore those...

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

facebook-github-bot pushed a commit that referenced this pull request Jun 5, 2025
… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Differential Revision: D75928565
kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
facebook-github-bot pushed a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

kiukchung added a commit that referenced this pull request Jun 10, 2025
… takes a list of remote channel addresses (#170)

Summary:
Pull Request resolved: #170

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
… takes a list of remote channel addresses (#170)

Summary:

See: P1833193535 for how the user-facing UX would look like.

NOTE: Recommended to start the review at `monarch‎/python‎/tests‎/test_allocator.py‎` to get a sense of what the API/Usage looks like.

NOTE: hyperactor's ChannelAddr can be represented in string format such as: `tcp!127.0.0.1:26600` or `metatls!devgpu001.pci.facebook.com:26600` which includes all the necessary information to create a `Channel`. Unfortunately, the current `RemoteAllocator` related interfaces take `transport` (`ChannelTransport`), `port`, and a list of `hostnames`, and applies the same transport and port to all. This isn't ideal (especially for flexibility in deployment and testing). So the python bindings take a list of channel address strings rather than a list of hostnames.

To support multi-node actor meshes in OSS without having to write a custom allocator for each scheduler (e.g. `SlurmAllocator`, `KubernetesAllocator`) we take advantage of the infrastructure we already have in TorchX and TorchElastic.

This Diff creates Python bindings for `RemoteAllocatorBase` that takes a list of server addresses (in channel_addr format - e.g. `metatls!devgpu032.nha1.facebook.com:26600` or `tcp!devgpu032.nha1.facebook.com:26601`) of remote-process-allocator server and connects to it.

The internals reuse existing `RemoteProcessAlloc` with a custom `PyRemoteProcessAllocInitializer` that simply returns a `Vec<RemoteProcessAllocHost>` given the user provided list of server addresses.

## Next Steps:
1. [1/2] Add hostnames to `monarch.tools.mesh_spec.MeshSpec` struct + ability to query hostnames given a running job.

2. [2/2] Make it possible to run 2x remote process allocators (each on its own port) on MAST

Reviewed By: technicianted

Differential Revision: D75928565
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75928565

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b844326.

@hemildesai hemildesai mentioned this pull request Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants