Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeRepair respects PDBs / terminationGracePeriodSeconds when nodepool does not have terminationGracePeriod set #2042

Open
andrewhibbert opened this issue Feb 28, 2025 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@andrewhibbert
Copy link

Description

Observed Behavior:

I'm not sure whether this is a bug or not. I see that NodeRepair works okay with a terminationGracePeriod set in the NodePool, however if it isn't set it seems to respect PDBs and terminationGracePeriodSeconds which I thought was strange:

After stopping kubelet on a node, PDB blocking:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    karpenter.sh/disrupted:NoSchedule
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-138-100-239.eu-west-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Fri, 28 Feb 2025 14:55:09 +0000
Conditions:
  Type                    Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----                    ------    -----------------                 ------------------                ------                    -------
  MemoryPressure          Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure            Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure             Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  Ready                   Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  ContainerRuntimeReady   True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   ContainerRuntimeIsReady   Monitoring for the ContainerRuntime system is active
  StorageReady            True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   DiskIsReady               Monitoring for the Disk system is active
  NetworkingReady         True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   NetworkingIsReady         Monitoring for the Networking system is active
  KernelReady             True      Fri, 28 Feb 2025 15:48:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   KernelIsReady             Monitoring for the Kernel system is active

Events:
  Type     Reason                 Age                From                       Message
  ----     ------                 ----               ----                       -------
  Warning  ContainerRuntimeReady  57m                eks-node-monitoring-agent  Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 61 -> 76)
  Normal   NodeNotReady           54m                node-controller            Node ip-10-138-100-239.eu-west-1.compute.internal status is now: NodeNotReady
  Normal   MemoryPressure         54m                karpenter                  Status condition transitioned, Type: MemoryPressure, Status: False -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   DiskPressure           54m                karpenter                  Status condition transitioned, Type: DiskPressure, Status: False -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   PIDPressure            54m                karpenter                  Status condition transitioned, Type: PIDPressure, Status: False -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   Ready                  54m                karpenter                  Status condition transitioned, Type: Ready, Status: True -> Unknown, Reason: NodeStatusUnknown, Message: Kubelet stopped posting node status.
  Normal   DisruptionBlocked      54m                karpenter                  Pdb "kube-system/ebs-csi-controller" prevents pod evictions
  Warning  ContainerRuntimeReady  52m                eks-node-monitoring-agent  Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 76 -> 91)
  Normal   DisruptionBlocked      50m (x5 over 67m)  karpenter                  Pdb "kyverno/kyverno-reports-controller" prevents pod evictions
  Warning  ContainerRuntimeReady  47m                eks-node-monitoring-agent  Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 91 -> 105)

terminationGracePeriodSeconds blocking:

Taints:             node.kubernetes.io/unreachable:NoExecute
                    karpenter.sh/disrupted:NoSchedule
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-138-100-239.eu-west-1.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Fri, 28 Feb 2025 14:55:09 +0000
Conditions:
  Type                    Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----                    ------    -----------------                 ------------------                ------                    -------
  MemoryPressure          Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure            Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure             Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  Ready                   Unknown   Fri, 28 Feb 2025 14:54:39 +0000   Fri, 28 Feb 2025 14:55:53 +0000   NodeStatusUnknown         Kubelet stopped posting node status.
  ContainerRuntimeReady   True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   ContainerRuntimeIsReady   Monitoring for the ContainerRuntime system is active
  StorageReady            True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   DiskIsReady               Monitoring for the Disk system is active
  NetworkingReady         True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   NetworkingIsReady         Monitoring for the Networking system is active
  KernelReady             True      Fri, 28 Feb 2025 16:08:36 +0000   Fri, 28 Feb 2025 14:28:36 +0000   KernelIsReady             Monitoring for the Kernel system is active

Events:
  Type     Reason                          Age                   From                       Message
  ----     ------                          ----                  ----                       -------
  Warning  FailedDraining                  19m (x14 over 45m)    karpenter                  Failed to drain node, 1 pods are waiting to be evicted
  Warning  ContainerRuntimeReady           4m53s (x11 over 54m)  eks-node-monitoring-agent  (combined from similar events): Newrelic-InfraRepeatedRestart: Systemd unit "newrelic-infra.service" has restarted (NRestarts 283 -> 298)
  Warning  TerminationGracePeriodExpiring  83s (x24 over 47m)    karpenter                  All pods will be deleted by 2025-02-28T15:25:53Z
  Normal   DisruptionBlocked               82s (x24 over 47m)    karpenter                  Node is deleting or marked for deletion

So the question is, is this a bug or are we advised to use this alongside terminationGracePeriod in the NodePool. Note that the reason why this is switched off is due to it hanging when nodes ended up in a broken state (kubelet not posting status due to exhausted CPU resources) and ideally I think we'd like to set this differently for each consolidation reason

Expected Behavior:

Removes node without terminationGracePeriod kicking in

Reproduction Steps (Please include YAML):

Versions:
1.2.1

  • Chart Version:
  • Kubernetes Version (kubectl version):
    1.30
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@andrewhibbert andrewhibbert added the kind/bug Categorizes issue or PR as related to a bug. label Feb 28, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 28, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

2 participants