-
Notifications
You must be signed in to change notification settings - Fork 4.1k
OCI provider: Avoid interpreting HTTP 404 as success on delete #8201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OCI provider: Avoid interpreting HTTP 404 as success on delete #8201
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlamillan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I verified this Node Delete API behavior:
Given that the compute instance entry for the terminated instance gets cleaned up after some time, it's reasonable to say that a 404 is highly unlikely due to "deleting an instance that doesn't exist" and is more likely due to an authorization reason. Change looks good to me. 👍
As a side note for this, the reason this exists is because of a very odd edge case discovered a while back where:
Which can result in pods scheduling onto a node that OKE has marked for deletion. So we decided to keep the taint on there to prevent any future pods from going on there to prevent an active disruption to a workload. |
Thanks for the explanation. Ideally, we'd have some mechanism to remove the Of course, this MR should reduce the chances of that occurring. |
This may be a relevant feature. I haven't looked into the specifics of this, but if this goes in, then we can probably remove our taint altogether. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This change addresses an issue in the OCI (OKE) provider where a
HTTP 404 Error Code: NotAuthorizedOrNotFound
response from the OCI Delete Node API call is misinterpreted and reported as a successful node deletion..As the OCI provider error message above indicates, the
404
is just as likely to be caused by the user lacking the necessary permissions to delete the node (i.e.NotAuthorizedOrNotFound
). The provider also adds its own hard taintignore-taint.cluster-autoscaler.kubernetes.io/oke-impending-node-termination
to the node (that is already drained), which unlike the CA's ownToBeDeletedByClusterAutoscaler
taint, is not released when a scale down fails.The result is that many nodes may end up inadvertently left drained and tainted with no ability to fix itself.
To avoid this situation, we should interpret any
HTTP 404
on the Delete Node API call as an error and let the CA orchestration logic decide what to do about the failed scale down.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: