Skip to content

Include short unregistered nodes in calculation of incorrect node group #5894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 29, 2023

Conversation

BigDarkClown
Copy link
Contributor

@BigDarkClown BigDarkClown commented Jun 27, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

There are 2 mechanisms used to fix the incorrect node group size:

  • Unregistered nodes are removed after the default time of 15 minutes.
  • Node group is resized to the number of registered nodes if this number is smaller than the target size after the default time of 15 minutes.

Both these mechanisms use the same time threshold, so there can occur race conditions between them. For example in the following situation:

Node group with:

  • 1 registered node
  • 1 unregistered node
  • target size of 2

In this instance it is better to remove the unregistered node rather than resizing the node group. The issue with the second approach is that we don't have a guarantee which node will be removed - it can be either the unregistered one (desired) or the registered one (undesired).

To prevent this the second logic takes the nodes unregistered for > 15 minutes into account when deciding if it should resize the node group. It however does not take the nodes unready for < 15 minutes, which can result in the second mechanism being triggered.

This change takes all the unregistered nodes into account when deciding on the trigger condition for the second mechanism. This should prevent the race condition I've described here from happening.

Does this PR introduce a user-facing change?

Because of this change some cloud providers may trigger the "fixNodeGroupSize" 
logic more frequently in their clusters than they did before. This can happen if the 
scale-ups take a long time to create nodes and can result in the CA treating the 
node groups as incorrectly sized and scaling them down. This can be prevented 
by increasing the max node provisioning time parameter of Cluster Autoscaler.

You can check if your cluster is experiencing this issue by looking for the following log line:
"Decreasing size of <node group name>, expected=X current=Y delta=Z"

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/cluster-autoscaler labels Jun 27, 2023
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and x13n June 27, 2023 14:36
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 27, 2023
@krzysied
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 27, 2023
@BigDarkClown
Copy link
Contributor Author

/assign @MaciekPytel

@@ -512,7 +512,7 @@ func (csr *ClusterStateRegistry) updateAcceptableRanges(targetSize map[string]in
size := targetSize[nodeGroup.Id()]
readiness := csr.perNodeGroupReadiness[nodeGroup.Id()]
result[nodeGroup.Id()] = AcceptableRange{
MinNodes: size - len(readiness.LongUnregistered),
MinNodes: size - len(readiness.Unregistered) - len(readiness.LongUnregistered),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to subtract Unregistered nodes here, given that we have a separate (and arguably better) mechanism for tracking them. I'm wondering though - doesn't that make us double-count scale-ups?
We subtract scaleUpRequest.Increase below and I'd imagine in most cases the unregistered nodes we're subtracting here come from scale-up requests.
Wouldn't it make sense to remove the scaleUpRequest part below? Sure, we will get out of acceptableRange for the window of time between resizing nodeGroup and instances being created, but having a timeout on instances being created seems like the purpose of this entire mechanism.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense. The mechanism triggers only if the diff between the target size and current size is both outside of range and constant for 15 minutes. So we should not really be worried about scale-up requests, as the unregistered nodes should come and become registered.

The only worry I have is that it will trigger this mechanism more often than it already is. I'm not 100% sure if that is desirable for everyone, but I agree that it would be a more correct behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the change to not take scale-up requests into account. I've also removed the scale-down requests as they are not used anywhere and they are effectively a dead code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did another 180 turn.

I've left the scale-up / scale-down logic in calculating acceptable range. This is used by the IsNodeGroupHealthy method, and I don't think we want to change the behaviour there. I've removed the long unregistered nodes from here as they IMO should impact the node pool healthiness.

I however including the unregistered nodes when calculating the incorrect node group size. This is used only for correcting the node group size after 15 minutes, so we should be good here.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 28, 2023
@BigDarkClown BigDarkClown force-pushed the fix-cs branch 2 times, most recently from ae0adbf to dc636ec Compare June 28, 2023 15:40
@BigDarkClown BigDarkClown changed the title Include all unregistered nodes in node group acceptable range Include short unregistered nodes in calculation of incorrect node group Jun 28, 2023
@BigDarkClown BigDarkClown force-pushed the fix-cs branch 2 times, most recently from f9238e5 to bf1f123 Compare June 28, 2023 15:46
if len(readiness.Registered) > acceptableRange.MaxNodes ||
len(readiness.Registered) < acceptableRange.MinNodes {
unregisteredNodes := len(readiness.Unregistered) + len(readiness.LongUnregistered)
if len(readiness.Registered) > acceptableRange.CurrentTarget ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why current target and not MaxNodes? It seem like MaxNodes is an obvious fit here, there are no changes to how you calculate it and I don't see why we shouldn't offset for scale-downs that we're aware of (tbh I wouldn't implement the whole acceptableRanges logic if we were starting from scratch, but why not use it if it's already there?

Copy link
Contributor Author

@BigDarkClown BigDarkClown Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we don't want the incorrect node group size to depend on the ongoing scale-ups and scale-downs. We want to only guard on the expected size. Otherwise we will effectively double our max node provisioning period for this mechanism.

Also in all fairness - the mechanism for fixing situations when current size > expected size is not implemented yet. It's possible that it will require some slightly different edge conditions, but I think it's better to leave figuring them to the person implementing this mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@MaciekPytel
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BigDarkClown, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 29, 2023
@k8s-ci-robot k8s-ci-robot merged commit 973f9fd into kubernetes:master Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants