-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Add support for SageMaker HyperPod nodes by skipping them in cluster autoscaler #8195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Welcome @YQ-Wang! |
Hi @YQ-Wang. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: YQ-Wang The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -115,6 +115,13 @@ func (aws *awsCloudProvider) NodeGroupForNode(node *apiv1.Node) (cloudprovider.N | |||
klog.Warningf("Node %v has no providerId", node.Name) | |||
return nil, nil | |||
} | |||
|
|||
// Skip SageMaker instances | |||
if strings.Contains(node.Spec.ProviderID, "/sagemaker") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for submitting the PR. I work on SageMaker HyperPod and ignoring HyperPod nodes in cluster autoscaler is correct approach as cluster autoscaler integration is not supported.
Getting a review from folks who have expertise on this codebase would be helpful.
Question - Should this change be similar to fargate? hyperpod nodes have hyperpod-
prefix
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L142-L144
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we can change the logic to checking the hyperpod- prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL
@@ -115,6 +115,13 @@ func (aws *awsCloudProvider) NodeGroupForNode(node *apiv1.Node) (cloudprovider.N | |||
klog.Warningf("Node %v has no providerId", node.Name) | |||
return nil, nil | |||
} | |||
|
|||
// Skip SageMaker instances | |||
if strings.HasPrefix(node.GetName(), "hyperpod") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of prefix on the node name, can we rely on a custom tag on the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the aws hyperpod engineer @surajkota suggests me to check the prefix in order to match the current pattern for fargate so I modified it to the current approach that comparing the prefix.
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L142-L144
#8195 (comment)
@@ -209,6 +221,14 @@ var validAwsRefIdRegex = regexp.MustCompile(fmt.Sprintf(`^aws\:\/\/\/[-0-9a-z]*\ | |||
// AwsRefFromProviderId creates AwsInstanceRef object from provider id which | |||
// must be in format: aws:///zone/name | |||
func AwsRefFromProviderId(id string) (*AwsInstanceRef, error) { | |||
// Special case for SageMaker format: aws:///<region>/sagemaker/... | |||
if strings.HasPrefix(id, "aws:///") && strings.Contains(id, "/sagemaker") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a pattern match here instead of prefix check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -143,6 +150,11 @@ func (aws *awsCloudProvider) HasInstance(node *apiv1.Node) (bool, error) { | |||
return true, cloudprovider.ErrNotImplemented | |||
} | |||
|
|||
// Skip SageMaker instances | |||
if strings.HasPrefix(node.GetName(), "hyperpod") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand why do you install CAS on the cluster when the worker nodes naming convention doesn't work properly with CAS scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HyperPod is a sagemaker instance pool that supports working with the existing EKS cluster. While hyperpod currently doesn't work with CAS, the regular EC2 instances requires CAS in the EKS cluster.
What type of PR is this?
What this PR does / why we need it:
/kind bug
Which issue(s) this PR fixes:
Fixes #7540
Special notes for your reviewer:
SageMaker HyperPod provides managed Kubernetes nodes, but these nodes break the cluster autoscaler because:
aws:///region/sagemaker/cluster/instance-id
that doesn't match the standard EC2 format expected byAwsRefFromProviderId()
Following the same pattern used for Fargate nodes, this PR adds support for SageMaker HyperPod nodes by:
/sagemaker
in the provider ID stringDoes this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: