NodeRepair respects PDBs / terminationGracePeriodSeconds when nodepool does not have terminationGracePeriod set #2042

andrewhibbert · 2025-02-28T16:19:03Z

Description

Observed Behavior:

I'm not sure whether this is a bug or not. I see that NodeRepair works okay with a terminationGracePeriod set in the NodePool, however if it isn't set it seems to respect PDBs and terminationGracePeriodSeconds which I thought was strange:

After stopping kubelet on a node, PDB blocking:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    karpenter.sh/disrupted:NoSchedule
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-138-100-239.eu-west-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Fri, 28 Feb 2025 14:55:09 +0000
Conditions:
  Type                    Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----                    ------    -----------------                 ------------------                ------                    -------
  MemoryPressure          Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure            Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure             Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  Ready                   Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  ContainerRuntimeReady   True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   ContainerRuntimeIsReady   Monitoring for the ContainerRuntime system is active
  StorageReady            True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   DiskIsReady               Monitoring for the Disk system is active
  NetworkingReady         True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   NetworkingIsReady         Monitoring for the Networking system is active
  KernelReady             True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   KernelIsReady             Monitoring for the Kernel system is active

Events:
  Type     Reason                 Age                From                       Message
  ----     ------                 ----               ----                       -------
  Warning  ContainerRuntimeReady  57m                eks-node-monitoring-agent  Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 61 -> 76)
  Normal   NodeNotReady           54m                node-controller            Node ip-10-138-100-239.eu-west-1.compute.internal status is now: NodeNotReady
  Normal   MemoryPressure         54m                karpenter                  Status condition transitioned, Type: MemoryPressure, Status: False -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   DiskPressure           54m                karpenter                  Status condition transitioned, Type: DiskPressure, Status: False -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   PIDPressure            54m                karpenter                  Status condition transitioned, Type: PIDPressure, Status: False -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   Ready                  54m                karpenter                  Status condition transitioned, Type: Ready, Status: True -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   DisruptionBlocked      54m                karpenter                  Pdb "kube-system/ebs-csi-controller" prevents pod evictions
  Warning  ContainerRuntimeReady  52m                eks-node-monitoring-agent  Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 76 -> 91)
  Normal   DisruptionBlocked      50m (x5 over 67m)  karpenter                  Pdb "kyverno/kyverno-reports-controller" prevents pod evictions
  Warning  ContainerRuntimeReady  47m                eks-node-monitoring-agent  Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 91 -> 105)

terminationGracePeriodSeconds blocking:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    karpenter.sh/disrupted:NoSchedule
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-138-100-239.eu-west-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Fri, 28 Feb 2025 14:55:09 +0000
Conditions:
  Type                    Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----                    ------    -----------------                 ------------------                ------                    -------
  MemoryPressure          Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure            Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure             Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  Ready                   Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  ContainerRuntimeReady   True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   ContainerRuntimeIsReady   Monitoring for the ContainerRuntime system is active
  StorageReady            True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   DiskIsReady               Monitoring for the Disk system is active
  NetworkingReady         True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   NetworkingIsReady         Monitoring for the Networking system is active
  KernelReady             True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   KernelIsReady             Monitoring for the Kernel system is active

Events:
  Type     Reason                          Age                   From                       Message
  ----     ------                          ----                  ----                       -------
  Warning  FailedDraining                  19m (x14 over 45m)    karpenter                  Failed to drain node, 1 pods are waiting to be evicted
  Warning  ContainerRuntimeReady           4m53s (x11 over 54m)  eks-node-monitoring-agent  (combined from similar events): Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 283 -> 298)
  Warning  TerminationGracePeriodExpiring  83s (x24 over 47m)    karpenter                  All pods will be deleted by 2025-02-28T15:25:53Z
  Normal   DisruptionBlocked               82s (x24 over 47m)    karpenter                  Node is deleting or marked for deletion

So the question is, is this a bug or are we advised to use this alongside terminationGracePeriod in the NodePool. Note that the reason why this is switched off is due to it hanging when nodes ended up in a broken state (kubelet not posting status due to exhausted CPU resources) and ideally I think we'd like to set this differently for each consolidation reason

Expected Behavior:

Removes node without terminationGracePeriod kicking in

Reproduction Steps (Please include YAML):

Versions:
1.2.1

Chart Version:
Kubernetes Version (kubectl version):
1.30

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2025-02-28T16:19:11Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

andrewhibbert added the kind/bug Categorizes issue or PR as related to a bug. label Feb 28, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 28, 2025

k8s-ci-robot added the needs-priority label Feb 28, 2025

andrewhibbert mentioned this issue Feb 28, 2025

Feature NodeRepair not working #2004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeRepair respects PDBs / terminationGracePeriodSeconds when nodepool does not have terminationGracePeriod set #2042

NodeRepair respects PDBs / terminationGracePeriodSeconds when nodepool does not have terminationGracePeriod set #2042

andrewhibbert commented Feb 28, 2025

k8s-ci-robot commented Feb 28, 2025

NodeRepair respects PDBs / terminationGracePeriodSeconds when nodepool does not have terminationGracePeriod set #2042

NodeRepair respects PDBs / terminationGracePeriodSeconds when nodepool does not have terminationGracePeriod set #2042

Comments

andrewhibbert commented Feb 28, 2025

Description

k8s-ci-robot commented Feb 28, 2025