-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Include short unregistered nodes in calculation of incorrect node group #5894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/lgtm |
/assign @MaciekPytel |
@@ -512,7 +512,7 @@ func (csr *ClusterStateRegistry) updateAcceptableRanges(targetSize map[string]in | |||
size := targetSize[nodeGroup.Id()] | |||
readiness := csr.perNodeGroupReadiness[nodeGroup.Id()] | |||
result[nodeGroup.Id()] = AcceptableRange{ | |||
MinNodes: size - len(readiness.LongUnregistered), | |||
MinNodes: size - len(readiness.Unregistered) - len(readiness.LongUnregistered), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to subtract Unregistered nodes here, given that we have a separate (and arguably better) mechanism for tracking them. I'm wondering though - doesn't that make us double-count scale-ups?
We subtract scaleUpRequest.Increase below and I'd imagine in most cases the unregistered nodes we're subtracting here come from scale-up requests.
Wouldn't it make sense to remove the scaleUpRequest part below? Sure, we will get out of acceptableRange for the window of time between resizing nodeGroup and instances being created, but having a timeout on instances being created seems like the purpose of this entire mechanism.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it makes sense. The mechanism triggers only if the diff between the target size and current size is both outside of range and constant for 15 minutes. So we should not really be worried about scale-up requests, as the unregistered nodes should come and become registered.
The only worry I have is that it will trigger this mechanism more often than it already is. I'm not 100% sure if that is desirable for everyone, but I agree that it would be a more correct behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the change to not take scale-up requests into account. I've also removed the scale-down requests as they are not used anywhere and they are effectively a dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I did another 180 turn.
I've left the scale-up / scale-down logic in calculating acceptable range. This is used by the IsNodeGroupHealthy
method, and I don't think we want to change the behaviour there. I've removed the long unregistered nodes from here as they IMO should impact the node pool healthiness.
I however including the unregistered nodes when calculating the incorrect node group size. This is used only for correcting the node group size after 15 minutes, so we should be good here.
ae0adbf
to
dc636ec
Compare
f9238e5
to
bf1f123
Compare
if len(readiness.Registered) > acceptableRange.MaxNodes || | ||
len(readiness.Registered) < acceptableRange.MinNodes { | ||
unregisteredNodes := len(readiness.Unregistered) + len(readiness.LongUnregistered) | ||
if len(readiness.Registered) > acceptableRange.CurrentTarget || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why current target and not MaxNodes? It seem like MaxNodes is an obvious fit here, there are no changes to how you calculate it and I don't see why we shouldn't offset for scale-downs that we're aware of (tbh I wouldn't implement the whole acceptableRanges logic if we were starting from scratch, but why not use it if it's already there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that we don't want the incorrect node group size to depend on the ongoing scale-ups and scale-downs. We want to only guard on the expected size. Otherwise we will effectively double our max node provisioning period for this mechanism.
Also in all fairness - the mechanism for fixing situations when current size > expected size
is not implemented yet. It's possible that it will require some slightly different edge conditions, but I think it's better to leave figuring them to the person implementing this mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BigDarkClown, MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
There are 2 mechanisms used to fix the incorrect node group size:
Both these mechanisms use the same time threshold, so there can occur race conditions between them. For example in the following situation:
Node group with:
In this instance it is better to remove the unregistered node rather than resizing the node group. The issue with the second approach is that we don't have a guarantee which node will be removed - it can be either the unregistered one (desired) or the registered one (undesired).
To prevent this the second logic takes the nodes unregistered for > 15 minutes into account when deciding if it should resize the node group. It however does not take the nodes unready for < 15 minutes, which can result in the second mechanism being triggered.
This change takes all the unregistered nodes into account when deciding on the trigger condition for the second mechanism. This should prevent the race condition I've described here from happening.
Does this PR introduce a user-facing change?