-
Notifications
You must be signed in to change notification settings - Fork 631
Fix flaky of test_multi_echo
-- change sshd config to support large number of jobs
#5323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
test_multi_echo
-- increase sshd config to support large number of jobstest_multi_echo
-- change sshd config to support large number of jobs
/smoke-test --kubernetes -k test_multi_echo |
/smoke-test --aws -k test_multi_echo |
/smoke-test --gcp -k test_multi_echo |
/smoke-test --azure -k test_multi_echo |
@@ -32,7 +32,7 @@ available_node_types: | |||
ray_head_default: | |||
resources: {{instance_resources}} | |||
node_config: | |||
image_id: {{image_id}} | |||
image_id: {{image_id}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space formatted by pre-commit
sky/skylet/constants.py
Outdated
SSH_MAX_SESSIONS_CONFIG = ( | ||
'sudo bash -c \'' | ||
'echo "MaxSessions 200" >> /etc/ssh/sshd_config; ' | ||
'echo "MaxStartups 150:30:200" >> /etc/ssh/sshd_config; ' | ||
'(systemctl restart sshd || service ssh restart); ' | ||
'\'') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restart ssh here will be not efficient, as it may cause the setup step to fail and run again? Can we have it in cloud init instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/smoke-test --aws -k test_multi_echo |
/smoke-test --aws -k test_multi_echo |
/smoke-test --aws -k test_multi_echo |
/smoke-test --azure -k test_multi_echo |
/smoke-test --aws -k test_multi_echo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zpoint for the quick fix! Could we do a quick profile of multitime -n 5 "sky launch -y --cpus 2"
before and after this change, in case there is any other side effect.
Also, would be good to run a smoke test to make sure it does not affect our k8s, AWS and GCP use cases.
branch: 1: sky launch -y --cpus 2
Mean Std.Dev. Min Median Max
real 77.243 8.311 69.421 72.582 90.319
user 1.389 0.082 1.302 1.368 1.509
sys 0.175 0.009 0.163 0.177 0.189 branch ===> multitime results
1: sky launch -y --cpus 2
Mean Std.Dev. Min Median Max
real 79.777 9.192 71.254 77.500 96.844
user 1.375 0.085 1.272 1.367 1.499
sys 0.178 0.007 0.169 0.177 0.186 looks similar. |
/smoke-test --aws |
/smoke-test --gcp |
/smoke-test --azure |
/smoke-test --kubernetes |
The GCP failure is due to change of function signature in the #5351 Its not related |
Recently we find
test_multi_echo
is highly flaky.Running 150 tasks simultaneously may result in 0-5 errors, while the rest should succeed.
Logs from api server side:
Added more debug logs see this from api server:
ssh in to launched machine, see this in
/var/log/auth.log
Find a solution, verify this change on launched machine work
default value of
MaxSessions
is 10, default value ofMaxStartups
is 10:30:60The config means if more than 10 sessions, between 10-60 sessions, will drop 30% connection. If above 60 sessions, will drop all.
Our test submits 150 jobs in parallel. Even with small time differences in SSH connections and commands, it's very likely to exceed the default 10-session limit.
So we add same config to all
{cloud}-ray.yml
file, to let ray do same config to all launched VM.The latency looks acceptable.
Tested (run the relevant ones):
bash format.sh
/smoke-test --aws
/smoke-test --gcp
/smoke-test --azure
/smoke-test --kubernetes