[k8s] Force deletion of misbehaving pods #3755

romilbhardwaj · 2024-07-16T21:34:15Z

User reported terminating 70B models served with vLLM would stall for 20 min.

service:
  readiness_probe: /v1/models

resources:
  # Can change to use more via `--gpus A100:N`.  N can be 1 to 8.
  accelerators: A100:2
  cpus: 22
  memory: 500
  # Note: Big models need LOTS of disk space, especially if saved in float32.
  # So specify a lot of disk.
  disk_size: 400
  # Keep fixed.
  cloud: kubernetes
  ports: 8000
  image_id: docker:vllm/vllm-openai:latest

envs:
  # Specify the training config via `--env MODEL=<>`
  MODEL: ""

setup: |
  conda deactivate
  python3 -c "import huggingface_hub; huggingface_hub.login('${HUGGINGFACE_TOKEN}')"

run: |
  conda deactivate
  python3 -u -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --trust-remote-code \
    --model $MODEL

From the kubelet:

error killing pod: [failed to “KillContainer” for “ray-node” with KillContainerError: “rpc error: code = DeadlineExceeded desc = context deadline exceeded”, failed to “KillPodSandbox” for “57e4f054-de56-4a2e-ad68-bd1d786fb02a” with KillPodSandboxError: “rpc error: code = Unknown desc = failed to stop container \“b0da7a28782608dd40df232ac4cb8d75b11a5fe64eeb45745a6dfa6bceee87b7\“: failed to kill container \“b0da7a28782608dd40df232ac4cb8d75b11a5fe64eeb45745a6dfa6bceee87b7\“: context deadline exceeded: unknown”]

kubectl delete pod NAME --grace-period=0 --force fixes it. I've seen this issue before when running training on Kubernetes outside of SkyPilot, and IIRC it is related to erring processes leaving file descriptors open that the kubelet keeps waiting for to be closed.

We should probably have this --grace-period=0 --force logic in our pod termination:

skypilot/sky/provision/kubernetes/instance.py

Lines 616 to 617 in 465d36c

    
           kubernetes.core_api().delete_namespaced_pod( 
        
               pod_name, namespace, _request_timeout=config_lib.DELETION_TIMEOUT)

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-14T01:59:13Z

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-11-24T02:07:36Z

This issue was closed because it has been stalled for 10 days with no activity.

romilbhardwaj · 2025-04-10T22:28:26Z

Reopening - ran into this today when a fio benchmark got stalled, and sky down wouldn't work. Had to manually kubectl delete pod --force

romilbhardwaj · 2025-04-25T13:56:26Z

Ran into this again yesterday with a misbehaving mingpt run.

romilbhardwaj added the k8s Kubernetes related items label Jul 16, 2024

github-actions bot added the Stale label Nov 14, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 24, 2024

romilbhardwaj reopened this Apr 10, 2025

github-actions bot removed the Stale label Apr 11, 2025

romilbhardwaj mentioned this issue Apr 25, 2025

[k8s] Force terminate misbehaving pods #5370

Merged

romilbhardwaj closed this as completed in #5370 Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Force deletion of misbehaving pods #3755

[k8s] Force deletion of misbehaving pods #3755

romilbhardwaj commented Jul 16, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024

github-actions bot commented Nov 24, 2024

romilbhardwaj commented Apr 10, 2025

romilbhardwaj commented Apr 25, 2025

[k8s] Force deletion of misbehaving pods #3755

[k8s] Force deletion of misbehaving pods #3755

Comments

romilbhardwaj commented Jul 16, 2024 • edited Loading

github-actions bot commented Nov 14, 2024

github-actions bot commented Nov 24, 2024

romilbhardwaj commented Apr 10, 2025

romilbhardwaj commented Apr 25, 2025

romilbhardwaj commented Jul 16, 2024 •

edited

Loading