Skip to content

OCI provider: Avoid interpreting HTTP 404 as success on delete #8201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jlamillan
Copy link
Contributor

@jlamillan jlamillan commented Jun 4, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This change addresses an issue in the OCI (OKE) provider where a HTTP 404 Error Code: NotAuthorizedOrNotFound response from the OCI Delete Node API call is misinterpreted and reported as a successful node deletion..

As the OCI provider error message above indicates, the 404 is just as likely to be caused by the user lacking the necessary permissions to delete the node (i.e. NotAuthorizedOrNotFound). The provider also adds its own hard taint ignore-taint.cluster-autoscaler.kubernetes.io/oke-impending-node-termination to the node (that is already drained), which unlike the CA's own ToBeDeletedByClusterAutoscaler taint, is not released when a scale down fails.
The result is that many nodes may end up inadvertently left drained and tainted with no ability to fix itself.

To avoid this situation, we should interpret any HTTP 404 on the Delete Node API call as an error and let the CA orchestration logic decide what to do about the failed scale down.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

n/a

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

n/a

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler area/provider/oci Issues or PRs related to oci provider labels Jun 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlamillan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 4, 2025
@jlamillan jlamillan marked this pull request as draft June 4, 2025 00:57
@trungng92
Copy link
Contributor

I verified this Node Delete API behavior:

  • Deleting a running instance: returns a 200
  • Deleting an instance that has been called with Node Delete: returns a 409
  • Deleting an instance that has been deleted (instance is terminated): returns a 409
  • Deleting an instance (ocid) that doesn't exist: returns a 404

Given that the compute instance entry for the terminated instance gets cleaned up after some time, it's reasonable to say that a 404 is highly unlikely due to "deleting an instance that doesn't exist" and is more likely due to an authorization reason.

Change looks good to me. 👍

Worse, before returning, the provider also adds its own ignore-taint.cluster-autoscaler.kubernetes.io/oke-impending-node-termination taint to the drained node, which is never removed from the node unlike the CA’s own ToBeDeletedByClusterAutoscaler taint.

As a side note for this, the reason this exists is because of a very odd edge case discovered a while back where:

  1. make a call to delete the node
  2. node gets tainted as ToBeDeleted
  3. if the node isn’t deleted 10 minutes later, CA will retry making a call to delete the node.
  4. if this second call fails, then it will actually remove the ToBeDeleted taint

Which can result in pods scheduling onto a node that OKE has marked for deletion. So we decided to keep the taint on there to prevent any future pods from going on there to prevent an active disruption to a workload.

@jlamillan jlamillan changed the title WIP: OCI provider: Avoid interpreting HTTP 404 as success on delete OCI provider: Avoid interpreting HTTP 404 as success on delete Jun 5, 2025
@jlamillan jlamillan marked this pull request as ready for review June 5, 2025 16:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2025
@k8s-ci-robot k8s-ci-robot requested a review from vadasambar June 5, 2025 16:45
@jlamillan
Copy link
Contributor Author

As a side note for this, the reason this exists is because of a very odd edge case discovered a while back where:

1. make a call to delete the node

2. node gets tainted as ToBeDeleted

3. if the node isn’t deleted 10 minutes later, CA will retry making a call to delete the node.

4. if this second call fails, then it will actually remove the ToBeDeleted taint

Which can result in pods scheduling onto a node that OKE has marked for deletion. So we decided to keep the taint on there to prevent any future pods from going on there to prevent an active disruption to a workload.

Thanks for the explanation.

Ideally, we'd have some mechanism to remove the ignore-taint.cluster-autoscaler.kubernetes.io/oke-impending-node-termination taint after some timeout (similar to ToBeDeletedByClusterAutoscaler). Otherwise nodes that failed to be deleted will be left hanging around with the taint.

Of course, this MR should reduce the chances of that occurring.

@trungng92
Copy link
Contributor

#8183

This may be a relevant feature. I haven't looked into the specifics of this, but if this goes in, then we can probably remove our taint altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/oci Issues or PRs related to oci provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants