Skip to content

Fix flaky of test_multi_echo -- change sshd config to support large number of jobs #5323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Apr 23, 2025

Recently we find test_multi_echo is highly flaky.

Running 150 tasks simultaneously may result in 0-5 errors, while the rest should succeed.

Logs from api server side:

I 04-22 10:12:12 executor.py:380] Request 86e73a33-861e-4b21-bb78-07225617a187 failed due to sky.exceptions.CommandError: Command $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c 'import o... failed with return code 255.
I 04-22 10:12:12 executor.py:380] Failed to fetch job id.
I 04-22 10:12:12 executor.py:380] mux_client_request_session: session request failed: Session open refused by peer
I 04-22 10:12:12 executor.py:380] kex_exchange_identification: read: Connection reset by peer
I 04-22 10:12:12 executor.py:380] 

I 04-22 10:12:12 executor.py:380] Request 7e048731-79f6-4254-9d57-3e0d998c2f87 failed due to sky.exceptions.CommandError: Command $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c 'import o... failed with return code 255.
I 04-22 10:12:12 executor.py:380] Failed to fetch job id.
I 04-22 10:12:12 executor.py:380] mux_client_request_session: session request failed: Session open refused by peer
I 04-22 10:12:12 executor.py:380] 

Added more debug logs see this from api server:

Command ssh -T -i /home/buildkite/.sky/clients/9ff33f71/ssh/sky-key -o Port=22 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_9ff33f71/d6e26b6546/%C -o ControlPersist=300s [email protected] '/bin/bash --login -c '"'"'true && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && ($([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c '"'"'"'"'"'"'"'"'import os;import getpass;import sys;from sky import exceptions;from sky.skylet import log_lib, job_lib, constants;
if int(constants.SKYLET_VERSION) < 9: raise RuntimeError("SkyPilot runtime is too old, which does not support submitting jobs.");
job_id = job_lib.add_job('"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'-'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'9ff33f71'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'sky-2025-04-22-13-46-36-490196'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'1x[T4:0.1]'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"');print("Job ID: " + str(job_id), flush=True)'"'"'"'"'"'"'"'"')'"'"'' failed with return code 255, stdout: , stderr: mux_client_request_session: session request failed: Session open refused by peer^M
kex_exchange_identification: read: Connection reset by peer^M

Command ssh -T -i /home/buildkite/.sky/clients/9ff33f71/ssh/sky-key -o Port=22 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_9ff33f71/d6e26b6546/%C -o ControlPersist=300s [email protected] '/bin/bash --login -c '"'"'true && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && ($([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c '"'"'"'"'"'"'"'"'import os;import getpass;import sys;from sky import exceptions;from sky.skylet import log_lib, job_lib, constants;
if int(constants.SKYLET_VERSION) < 9: raise RuntimeError("SkyPilot runtime is too old, which does not support submitting jobs.");
job_id = job_lib.add_job('"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'-'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'9ff33f71'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'sky-2025-04-22-13-46-36-526405'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'1x[T4:0.1]'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"');print("Job ID: " + str(job_id), flush=True)'"'"'"'"'"'"'"'"')'"'"'' failed with return code 255, stdout: , stderr: mux_client_request_session: session request failed: Session open refused by peer^M

Command ssh -T -i /home/buildkite/.sky/clients/9ff33f71/ssh/sky-key -o Port=22 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_9ff33f71/d6e26b6546/%C -o ControlPersist=300s [email protected] '/bin/bash --login -c '"'"'true && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && ($([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u -c '"'"'"'"'"'"'"'"'import os;import getpass;import sys;from sky import exceptions;from sky.skylet import log_lib, job_lib, constants;
if int(constants.SKYLET_VERSION) < 9: raise RuntimeError("SkyPilot runtime is too old, which does not support submitting jobs.");
job_id = job_lib.add_job('"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'-'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'9ff33f71'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'sky-2025-04-22-13-46-36-530424'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"','"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'1x[T4:0.1]'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"');print("Job ID: " + str(job_id), flush=True)'"'"'"'"'"'"'"'"')'"'"'' failed with return code 255, stdout: , stderr: mux_client_request_session: session request failed: Session open refused by peer^M

ssh in to launched machine, see this in /var/log/auth.log

Apr 23 06:17:37 ip-172-31-38-224 sshd[11504]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)
Apr 23 06:17:37 ip-172-31-38-224 systemd-logind[702]: New session 112 of user ubuntu.
Apr 23 06:17:37 ip-172-31-38-224 sshd[1453]: error: no more sessions
Apr 23 06:17:37 ip-172-31-38-224 sshd[1453]: error: no more sessions
Apr 23 06:17:37 ip-172-31-38-224 sshd[928]: error: beginning MaxStartups throttling

pr 23 06:14:42 ip-172-31-38-224 systemd-logind[702]: Watching system buttons on /dev/input/event2 (AT Translated Set 2 keyboard)
Apr 23 06:14:42 ip-172-31-38-224 sshd[928]: Server listening on 0.0.0.0 port 22.
Apr 23 06:14:42 ip-172-31-38-224 sshd[928]: Server listening on :: port 22.
Apr 23 06:14:44 ip-172-31-38-224 sshd[1017]: error: kex_exchange_identification: Connection closed by remote host
Apr 23 06:14:44 ip-172-31-38-224 sshd[1017]: Connection closed by 52.207.251.14 port 51096
Apr 23 06:14:45 ip-172-31-38-224 sshd[1035]: Accepted publickey for ubuntu from 52.207.251.14 port 51108 ssh2: RSA SHA256:/bpcJ4Q/D4pKs7Opn82WGI3FDAnII6gm7c3jXAgpnmk
Apr 23 06:14:45 ip-172-31-38-224 sshd[1035]: pam_unix(sshd:session): session opened for user ubuntu(uid=1000) by (uid=0)

Find a solution, verify this change on launched machine work

sudo vim /etc/ssh/sshd_config

# Allow 150+ concurrent sessions
MaxSessions 200 

# Allow 150 unauthenticated connections, with gradual throttling
MaxStartups 150:30:200

sudo systemctl restart sshd

default value of MaxSessions is 10, default value of MaxStartups is 10:30:60

The config means if more than 10 sessions, between 10-60 sessions, will drop 30% connection. If above 60 sessions, will drop all.

Our test submits 150 jobs in parallel. Even with small time differences in SSH connections and commands, it's very likely to exceed the default 10-session limit.

So we add same config to all {cloud}-ray.yml file, to let ray do same config to all launched VM.

(base) ubuntu@ip-172-31-7-225:~$ time sudo systemctl reload sshd

real	0m0.038s
user	0m0.003s
sys	0m0.004s
(base) ubuntu@ip-172-31-7-225:~$ time sudo systemctl reload sshd

real	0m0.034s
user	0m0.005s
sys	0m0.003s
(base) ubuntu@ip-172-31-7-225:~$ time sudo systemctl reload sshd

real	0m0.036s
user	0m0.003s
sys	0m0.004s
(base) ubuntu@ip-172-31-7-225:~$ time sudo systemctl reload sshd

real	0m0.033s
user	0m0.004s
sys	0m0.003s
(base) ubuntu@ip-172-31-7-225:~$ time sudo systemctl reload sshd

real	0m0.033s
user	0m0.001s
sys	0m0.006s

The latency looks acceptable.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test --aws
  • All smoke tests: /smoke-test --gcp
  • All smoke tests: /smoke-test --azure
  • All smoke tests: /smoke-test --kubernetes

@zpoint zpoint requested a review from Michaelvll April 23, 2025 13:17
@zpoint zpoint changed the title Fix flaky of test_multi_echo -- increase sshd config to support large number of jobs Fix flaky of test_multi_echo -- change sshd config to support large number of jobs Apr 23, 2025
@zpoint
Copy link
Collaborator Author

zpoint commented Apr 23, 2025

/smoke-test --kubernetes -k test_multi_echo

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 23, 2025

/smoke-test --aws -k test_multi_echo

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 23, 2025

/smoke-test --gcp -k test_multi_echo

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 23, 2025

/smoke-test --azure -k test_multi_echo

@@ -32,7 +32,7 @@ available_node_types:
ray_head_default:
resources: {{instance_resources}}
node_config:
image_id: {{image_id}}
image_id: {{image_id}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space formatted by pre-commit

Comment on lines 281 to 286
SSH_MAX_SESSIONS_CONFIG = (
'sudo bash -c \''
'echo "MaxSessions 200" >> /etc/ssh/sshd_config; '
'echo "MaxStartups 150:30:200" >> /etc/ssh/sshd_config; '
'(systemctl restart sshd || service ssh restart); '
'\'')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restart ssh here will be not efficient, as it may cause the setup step to fail and run again? Can we have it in cloud init instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing SSH connection won't be affected.

However, I can use a safer approach like reload instead.

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 24, 2025

/smoke-test --aws -k test_multi_echo

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 24, 2025

/smoke-test --aws -k test_multi_echo

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 24, 2025

/smoke-test --aws -k test_multi_echo

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 24, 2025

/smoke-test --azure -k test_multi_echo

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 24, 2025

/smoke-test --aws -k test_multi_echo

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint for the quick fix! Could we do a quick profile of multitime -n 5 "sky launch -y --cpus 2" before and after this change, in case there is any other side effect.

Also, would be good to run a smoke test to make sure it does not affect our k8s, AWS and GCP use cases.

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 26, 2025

branch: dev/zeping/update_sshd_config

1: sky launch -y --cpus 2
            Mean        Std.Dev.    Min         Median      Max
real        77.243      8.311       69.421      72.582      90.319      
user        1.389       0.082       1.302       1.368       1.509       
sys         0.175       0.009       0.163       0.177       0.189  

branch master

===> multitime results
1: sky launch -y --cpus 2
            Mean        Std.Dev.    Min         Median      Max
real        79.777      9.192       71.254      77.500      96.844      
user        1.375       0.085       1.272       1.367       1.499       
sys         0.178       0.007       0.169       0.177       0.186 

looks similar.

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 26, 2025

/smoke-test --aws

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 26, 2025

/smoke-test --gcp

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 26, 2025

/smoke-test --azure

@zpoint
Copy link
Collaborator Author

zpoint commented Apr 26, 2025

/smoke-test --kubernetes

@zpoint zpoint requested a review from Michaelvll April 26, 2025 07:26
@zpoint
Copy link
Collaborator Author

zpoint commented Apr 26, 2025

The GCP failure is due to change of function signature in the #5351

Its not related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants