ray: [tune] rsync fails on azure cluster
What is the problem?
ray 1.5.0
When running tune on ray cluster in Azure, I get ERROR syncer.py:190 -- Sync execution failed
after every trial.
Reproduction (REQUIRED)
I created a ray cluster in Azure and launched a tune experiment on it by attaching the head and launching the tuning script from there.
After every trial I get the following error:
2021-08-11 13:08:27,356 INFO commands.py:298 -- Checking Azure environment settings
2021-08-11 13:08:27,364 ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 186, in sync_down
result = self.sync_client.sync_down(self._remote_path,
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/integration/docker.py", line 102, in sync_down
rsync(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/sdk.py", line 140, in rsync
return commands.rsync(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1070, in rsync
config = _bootstrap_config(config, no_config_cache=no_config_cache)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 315, in _bootstrap_config
resolved_config = provider_cls.bootstrap_config(config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 309, in bootstrap_config
return bootstrap_azure(cluster_config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 22, in bootstrap_azure
config = _configure_resource_group(config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 37, in _configure_resource_group
resource_client = _get_client(ResourceManagementClient, config)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 31, in _get_client
return get_client_from_cli_profile(client_class=client_class, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/client_factory.py", line 83, in get_client_from_cli_profile
credentials, subscription_id, tenant_id = get_azure_cli_credentials(
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/credentials.py", line 98, in get_azure_cli_credentials
cred, subscription_id, tenant_id = profile.get_login_credentials(resource=resource)
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 546, in get_login_credentials
account = self.get_subscription(subscription_id)
File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 505, in get_subscription
raise CLIError(_AZ_LOGIN_MESSAGE)
knack.util.CLIError: Please run 'az login' to setup account.
2021-08-11 13:08:27,365 INFO logger.py:697 -- Removed the following hyperparameter values when logging to tensorboard: {'hyperparameters/model_layers_intercept': (64, 128, 32), 'hyper
parameters/model_layers_slope': (128, 256, 64)}
This is my cluster yaml file:
cluster_name: my-private-cluster
max_workers: 10
target_utilization_fraction: 0.8
idle_timeout_minutes: 30
docker:
head_image: "myregistry.azurecr.io/custom-ay-ml-cpu:latest"
worker_image: "myregistry.azurecr.io/custom-ray-ml-gpu:latest"
container_name: "ray_py38_1.5.0_gpu"
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
provider:
type: azure
# https://azure.microsoft.com/en-us/global-infrastructure/locations
location: westeurope
resource_group: my-group-cluster
# set subscription id otherwise the default from az cli will be used
subscription_id: xxxxxx-xxxxxx-xxxxxx-xxxx-xxxx # (masked subscription id)
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
available_node_types:
node_cpu_2:
min_workers: 0
max_workers: 3
resources: {"CPU": 2}
node_config:
azure_arm_parameters:
vmSize: Standard_D2s_v3
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: 21.07.12
node_gpu_1_cpu_4:
min_workers: 1
max_workers: 2
resources: { "CPU": 4, "GPU": 1 }
node_config:
azure_arm_parameters:
vmSize: Standard_NC4as_T4_v3
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: 21.07.12
head_node_type: node_cpu_2
file_mounts:
~/.ssh/id_rsa.pub: "~/.ssh/id_rsa.pub"
~/my-project: "."
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
- ".venv"
- ".venv/**"
rsync_filter:
- ".gitignore"
initialization_commands:
# enable docker setup
- sudo usermod -aG docker $USER || true
- sleep 10 # delay to avoid docker permission denied errors
- az login -i && az acr login --name myregistry
# get rid of annoying Ubuntu message
- touch ~/.sudo_as_admin_successful
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}
and my cpu docker image is defined with (the gpu docker image is defined the same way, just replacing ray-ml:1.5.0-py38-cpu
with ray-ml:1.5.0-py38-gpu
):
FROM rayproject/ray-ml:1.5.0-py38-cpu
COPY requirements.txt ./
RUN sudo apt-get update && sudo apt-get install -y curl gnupg lsb-core && \
curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add - && \
echo "deb [arch=amd64] https://packages.microsoft.com/ubuntu/$(lsb_release -sr)/prod $(lsb_release -sc) main" | \
sudo tee /etc/apt/sources.list.d/mssql-release.list && \
sudo apt-get update && \
sudo ACCEPT_EULA=Y apt-get install -y msodbcsql17 mssql-tools unixodbc-dev build-essential unixodbc
RUN pip install --upgrade -r requirements.txt
- [x ] I have verified my script runs in a clean environment and reproduces the issue.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (12 by maintainers)
You would be able to toggle this via the
TUNE_SYNC_DISABLE_BOOTSTRAP
environment variable