-
Notifications
You must be signed in to change notification settings - Fork 630
[k8s] Force deletion of misbehaving pods #3755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This issue was closed because it has been stalled for 10 days with no activity. |
Reopening - ran into this today when a fio benchmark got stalled, and |
Ran into this again yesterday with a misbehaving mingpt run. |
User reported terminating 70B models served with vLLM would stall for 20 min.
From the kubelet:
kubectl delete pod NAME --grace-period=0 --force
fixes it. I've seen this issue before when running training on Kubernetes outside of SkyPilot, and IIRC it is related to erring processes leaving file descriptors open that the kubelet keeps waiting for to be closed.We should probably have this
--grace-period=0 --force
logic in our pod termination:skypilot/sky/provision/kubernetes/instance.py
Lines 616 to 617 in 465d36c
The text was updated successfully, but these errors were encountered: