Skip to content

Fix flaky of test_multi_echo -- change sshd config to support large number of jobs #5323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 30, 2025
Merged
2 changes: 2 additions & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -798,6 +798,8 @@ def write_cluster_config(
'sky_ray_yaml_local_path': tmp_yaml_path,
'sky_version': str(version.parse(sky.__version__)),
'sky_wheel_hash': wheel_hash,
'ssh_max_sessions_config':
constants.SET_SSH_MAX_SESSIONS_CONFIG_CMD,
# Authentication (optional).
**auth_config,

Expand Down
18 changes: 18 additions & 0 deletions sky/skylet/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,24 @@
# runs on a VM launched by SkyPilot will be recognized as the same user.
USER_ENV_VAR = f'{SKYPILOT_ENV_VAR_PREFIX}USER'

# SSH configuration to allow more concurrent sessions and connections.
# Default MaxSessions is 10.
# Default MaxStartups is 10:30:60, meaning:
# - Up to 10 unauthenticated connections are allowed without restriction.
# - From 11 to 60 connections, 30% are randomly dropped.
# - Above 60 connections, all are dropped.
# These defaults are too low for submitting many parallel jobs (e.g., 150),
# which can easily exceed the limits and cause connection failures.
# The new values (MaxSessions 200, MaxStartups 150:30:200) increase these
# limits significantly.
# TODO(zeping): Bake this configuration in SkyPilot default images.
SET_SSH_MAX_SESSIONS_CONFIG_CMD = (
'sudo bash -c \''
'echo "MaxSessions 200" >> /etc/ssh/sshd_config; '
'echo "MaxStartups 150:30:200" >> /etc/ssh/sshd_config; '
'(systemctl reload sshd || service ssh reload); '
'\'')

# Internal: Env var indicating the system is running with a remote API server.
# It is used for internal purposes, including the jobs controller to mark
# clusters as launched with a remote API server.
Expand Down
3 changes: 2 additions & 1 deletion sky/templates/aws-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ available_node_types:
{%- endfor %}
# Use IDMSv2
MetadataOptions:
HttpTokens: required
HttpTokens: required

head_node_type: ray.head.default

Expand Down Expand Up @@ -192,6 +192,7 @@ setup_commands:
{%- if remote_identity != 'LOCAL_CREDENTIALS' %}
rm ~/.aws/credentials || true;
{%- endif %}
{{ ssh_max_sessions_config }}

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.
3 changes: 2 additions & 1 deletion sky/templates/azure-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ available_node_types:
imageOffer: {{image_offer}}
imageSku: "{{image_sku}}"
imageVersion: {{image_version}}
# Community Gallery Image ID
# Community Gallery Image ID
communityGalleryImageId: {{community_gallery_image_id}}
osDiskSizeGB: {{disk_size}}
osDiskTier: {{disk_tier}}
Expand Down Expand Up @@ -130,3 +130,4 @@ setup_commands:
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
sudo mv /etc/nccl.conf /etc/nccl.conf.bak || true;
{{ ssh_max_sessions_config }}
1 change: 1 addition & 0 deletions sky/templates/cudo-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,4 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
{{ ssh_max_sessions_config }}
5 changes: 3 additions & 2 deletions sky/templates/do-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
{{ ssh_max_sessions_config }}

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.
2 changes: 1 addition & 1 deletion sky/templates/fluidstack-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -74,4 +74,4 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');

{{ ssh_max_sessions_config }}
2 changes: 1 addition & 1 deletion sky/templates/gcp-ray.yml.j2
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

cluster_name: {{cluster_name_on_cloud}}

# The maximum number of workers nodes to launch in addition to the head node.
Expand Down Expand Up @@ -225,6 +224,7 @@ setup_commands:
{%- endif %}
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
{{ ssh_max_sessions_config }}

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.
6 changes: 3 additions & 3 deletions sky/templates/ibm-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ available_node_types:
ray_head_default:
resources: {{instance_resources}}
node_config:
image_id: {{image_id}}
image_id: {{image_id}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space formatted by pre-commit

boot_volume_capacity: {{disk_capacity}}
volume_tier_name: general-purpose
instance_profile_name: {{instance_type}}
Expand All @@ -48,7 +48,7 @@ available_node_types:
max_workers: {{num_nodes - 1}}
resources: {{worker_instance_resources}}
node_config:
image_id: {{image_id}}
image_id: {{image_id}}
boot_volume_capacity: {{disk_capacity}}
volume_tier_name: general-purpose
instance_profile_name: {{worker_instance_type}}
Expand Down Expand Up @@ -106,7 +106,7 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf'); # This is needed for `-o allow_other` option for `goofys`;

{{ ssh_max_sessions_config }}

# Command to start ray on the head node. You don't need to change this.
# NOTE: these are very performance-sensitive. Each new item opens/closes an SSH
Expand Down
4 changes: 3 additions & 1 deletion sky/templates/kubernetes-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -661,7 +661,7 @@ available_node_types:
{{k8s_resource_key}}: {{accelerator_count}}
{% endif %}
{% endif %}

{% if high_availability %}
pvc_spec:
apiVersion: v1
Expand Down Expand Up @@ -747,6 +747,7 @@ available_node_types:
mountPath: /mnt/home # Temporary mount point for initialization
# should be replaced by pod spec
{% endif %}

setup_commands:
# Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
# Add ~/.ssh/sky-cluster-key to SSH config to allow nodes within a cluster to connect to each other
Expand Down Expand Up @@ -799,6 +800,7 @@ setup_commands:
# must and see if there's a workaround to grant minimum permission.
sudo chmod 777 /tmp/tpu_logs;
{% endif %}
{{ ssh_max_sessions_config }}

# Format: `REMOTE_PATH : LOCAL_PATH`
file_mounts: {
Expand Down
1 change: 1 addition & 0 deletions sky/templates/lambda-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
{{ ssh_max_sessions_config }}

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.
1 change: 1 addition & 0 deletions sky/templates/nebius-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -140,3 +140,4 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
{{ ssh_max_sessions_config }}
2 changes: 1 addition & 1 deletion sky/templates/oci-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ setup_commands:
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
sudo iptables -I INPUT -i ens3 -m state --state ESTABLISHED,RELATED,NEW -j ACCEPT;
{{ ssh_max_sessions_config }}

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.

1 change: 1 addition & 0 deletions sky/templates/paperspace-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,4 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
{{ ssh_max_sessions_config }}
1 change: 1 addition & 0 deletions sky/templates/runpod-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
{{ ssh_max_sessions_config }}

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.
1 change: 1 addition & 0 deletions sky/templates/scp-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf'); # This is needed for `-o allow_other` option for `goofys`;
{{ ssh_max_sessions_config }}

# Command to start ray on the head node. You don't need to change this.
# NOTE: these are very performance-sensitive. Each new item opens/closes an SSH
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/vast-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ setup_commands:
sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 1048576" >> /etc/security/limits.conf; echo "* hard nofile 1048576" >> /etc/security/limits.conf';
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
(grep -Pzo -q "Host \*\n StrictHostKeyChecking no" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n" >> ~/.ssh/config;

{{ ssh_max_sessions_config }}

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.
1 change: 1 addition & 0 deletions sky/templates/vsphere-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,4 @@ setup_commands:
sudo grep -e '^DefaultTasksMax' /etc/systemd/system.conf || (sudo bash -c 'echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf'); sudo systemctl set-property user-$(id -u $(whoami)).slice TasksMax=infinity; sudo systemctl daemon-reload;
mkdir -p ~/.ssh; (grep -Pzo -q "Host \*\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa" ~/.ssh/config) || printf "Host *\n StrictHostKeyChecking no\n IdentityFile ~/.ssh/sky-cluster-key\n IdentityFile ~/.ssh/id_rsa\n" >> ~/.ssh/config;
[ -f /etc/fuse.conf ] && sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf || (sudo sh -c 'echo "user_allow_other" > /etc/fuse.conf');
{{ ssh_max_sessions_config }}