OCI provider: Avoid interpreting HTTP 404 as success on delete #8201

jlamillan · 2025-06-04T00:57:02Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This change addresses an issue in the OCI (OKE) provider where a HTTP 404 Error Code: NotAuthorizedOrNotFound response from the OCI Delete Node API call is misinterpreted and reported as a successful node deletion..

As the OCI provider error message above indicates, the 404 is just as likely to be caused by the user lacking the necessary permissions to delete the node (i.e. NotAuthorizedOrNotFound). The provider also adds its own hard taint ignore-taint.cluster-autoscaler.kubernetes.io/oke-impending-node-termination to the node (that is already drained), which unlike the CA's own ToBeDeletedByClusterAutoscaler taint, is not released when a scale down fails.
The result is that many nodes may end up inadvertently left drained and tainted with no ability to fix itself.

To avoid this situation, we should interpret any HTTP 404 on the Delete Node API call as an error and let the CA orchestration logic decide what to do about the failed scale down.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

n/a

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

n/a

…all.

k8s-ci-robot · 2025-06-04T00:57:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlamillan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/oci/OWNERS~~ [jlamillan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

trungng92 · 2025-06-05T00:24:20Z

I verified this Node Delete API behavior:

Deleting a running instance: returns a 200
Deleting an instance that has been called with Node Delete: returns a 409
Deleting an instance that has been deleted (instance is terminated): returns a 409
Deleting an instance (ocid) that doesn't exist: returns a 404

Given that the compute instance entry for the terminated instance gets cleaned up after some time, it's reasonable to say that a 404 is highly unlikely due to "deleting an instance that doesn't exist" and is more likely due to an authorization reason.

Change looks good to me. 👍

Worse, before returning, the provider also adds its own ignore-taint.cluster-autoscaler.kubernetes.io/oke-impending-node-termination taint to the drained node, which is never removed from the node unlike the CA’s own ToBeDeletedByClusterAutoscaler taint.

As a side note for this, the reason this exists is because of a very odd edge case discovered a while back where:

make a call to delete the node
node gets tainted as ToBeDeleted
if the node isn’t deleted 10 minutes later, CA will retry making a call to delete the node.
if this second call fails, then it will actually remove the ToBeDeleted taint

Which can result in pods scheduling onto a node that OKE has marked for deletion. So we decided to keep the taint on there to prevent any future pods from going on there to prevent an active disruption to a workload.

jlamillan · 2025-06-05T17:28:20Z

As a side note for this, the reason this exists is because of a very odd edge case discovered a while back where:
1. make a call to delete the node

2. node gets tainted as ToBeDeleted

3. if the node isn’t deleted 10 minutes later, CA will retry making a call to delete the node.

4. if this second call fails, then it will actually remove the ToBeDeleted taint
Which can result in pods scheduling onto a node that OKE has marked for deletion. So we decided to keep the taint on there to prevent any future pods from going on there to prevent an active disruption to a workload.

Thanks for the explanation.

Ideally, we'd have some mechanism to remove the ignore-taint.cluster-autoscaler.kubernetes.io/oke-impending-node-termination taint after some timeout (similar to ToBeDeletedByClusterAutoscaler). Otherwise nodes that failed to be deleted will be left hanging around with the taint.

Of course, this MR should reduce the chances of that occurring.

trungng92 · 2025-06-05T22:11:45Z

#8183

This may be a relevant feature. I haven't looked into the specifics of this, but if this goes in, then we can probably remove our taint altogether.

OCI provider: Avoid interpreting HTTP 404 as success on delete node c…

af73650

…all.

k8s-ci-robot requested review from aleksandra-malinowska and x13n June 4, 2025 00:57

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 4, 2025

jlamillan marked this pull request as draft June 4, 2025 00:57

jlamillan changed the title ~~WIP: OCI provider: Avoid interpreting HTTP 404 as success on delete~~ OCI provider: Avoid interpreting HTTP 404 as success on delete Jun 5, 2025

jlamillan marked this pull request as ready for review June 5, 2025 16:45

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2025

k8s-ci-robot requested a review from vadasambar June 5, 2025 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCI provider: Avoid interpreting HTTP 404 as success on delete #8201

OCI provider: Avoid interpreting HTTP 404 as success on delete #8201

jlamillan commented Jun 4, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 4, 2025

Uh oh!

trungng92 commented Jun 5, 2025

Uh oh!

jlamillan commented Jun 5, 2025

Uh oh!

trungng92 commented Jun 5, 2025

Uh oh!

Uh oh!

OCI provider: Avoid interpreting HTTP 404 as success on delete #8201

Are you sure you want to change the base?

OCI provider: Avoid interpreting HTTP 404 as success on delete #8201

Conversation

jlamillan commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jun 4, 2025

Uh oh!

trungng92 commented Jun 5, 2025

Uh oh!

jlamillan commented Jun 5, 2025

Uh oh!

trungng92 commented Jun 5, 2025

Uh oh!

Uh oh!

jlamillan commented Jun 4, 2025 •

edited

Loading