ray: [tune] rsync fails on azure cluster

What is the problem?

ray 1.5.0

When running tune on ray cluster in Azure, I get ERROR syncer.py:190 -- Sync execution failed after every trial.

Reproduction (REQUIRED)

I created a ray cluster in Azure and launched a tune experiment on it by attaching the head and launching the tuning script from there.

After every trial I get the following error:

2021-08-11 13:08:27,356 INFO commands.py:298 -- Checking Azure environment settings                                                                                                    
2021-08-11 13:08:27,364 ERROR syncer.py:190 -- Sync execution failed.                                                                                                                  
Traceback (most recent call last):                                                                                                                                                     
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 186, in sync_down                                                                                    
    result = self.sync_client.sync_down(self._remote_path,                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/integration/docker.py", line 102, in sync_down                                                                        
    rsync(                                                                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/sdk.py", line 140, in rsync                                                                                     
    return commands.rsync(                                                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 1070, in rsync                                                                      
    config = _bootstrap_config(config, no_config_cache=no_config_cache)                                                                                                                
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 315, in _bootstrap_config                                                           
    resolved_config = provider_cls.bootstrap_config(config)                                                                                                                            
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 309, in bootstrap_config                                                
    return bootstrap_azure(cluster_config)                                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 22, in bootstrap_azure                                                         
    config = _configure_resource_group(config)                                                                                                                                         
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 37, in _configure_resource_group                                               
    resource_client = _get_client(ResourceManagementClient, config)                                                                                                                    
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/_azure/config.py", line 31, in _get_client                                                             
    return get_client_from_cli_profile(client_class=client_class, **kwargs)                                                                                                            
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/client_factory.py", line 83, in get_client_from_cli_profile                                                       
    credentials, subscription_id, tenant_id = get_azure_cli_credentials(                                                                                                               
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/common/credentials.py", line 98, in get_azure_cli_credentials                                                            
    cred, subscription_id, tenant_id = profile.get_login_credentials(resource=resource)                                                                                                
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 546, in get_login_credentials                                                                
    account = self.get_subscription(subscription_id)                                                                                                                                   
  File "/home/ray/anaconda3/lib/python3.8/site-packages/azure/cli/core/_profile.py", line 505, in get_subscription                                                                     
    raise CLIError(_AZ_LOGIN_MESSAGE)                                                                                                                                                  
knack.util.CLIError: Please run 'az login' to setup account.                                                                                                                           
2021-08-11 13:08:27,365 INFO logger.py:697 -- Removed the following hyperparameter values when logging to tensorboard: {'hyperparameters/model_layers_intercept': (64, 128, 32), 'hyper
parameters/model_layers_slope': (128, 256, 64)}                  

This is my cluster yaml file:

cluster_name: my-private-cluster

max_workers: 10
target_utilization_fraction: 0.8

idle_timeout_minutes: 30

docker:
    head_image: "myregistry.azurecr.io/custom-ay-ml-cpu:latest"
    worker_image: "myregistry.azurecr.io/custom-ray-ml-gpu:latest"
    container_name: "ray_py38_1.5.0_gpu"
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: westeurope
    resource_group: my-group-cluster
    # set subscription id otherwise the default from az cli will be used
    subscription_id: xxxxxx-xxxxxx-xxxxxx-xxxx-xxxx  # (masked subscription id)

auth:
    ssh_user: ubuntu
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

available_node_types:
    node_cpu_2:
        min_workers: 0
        max_workers: 3
        resources: {"CPU": 2}
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: 21.07.12
    node_gpu_1_cpu_4:
        min_workers: 1
        max_workers: 2
        resources: { "CPU": 4, "GPU": 1 }
        node_config:
            azure_arm_parameters:
                vmSize: Standard_NC4as_T4_v3
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: 21.07.12

head_node_type: node_cpu_2

file_mounts:
    ~/.ssh/id_rsa.pub: "~/.ssh/id_rsa.pub"
    ~/my-project: "."

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"
    - ".venv"
    - ".venv/**"

rsync_filter:
    - ".gitignore"

initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    - az login -i && az acr login --name myregistry
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

and my cpu docker image is defined with (the gpu docker image is defined the same way, just replacing ray-ml:1.5.0-py38-cpu with ray-ml:1.5.0-py38-gpu):

FROM rayproject/ray-ml:1.5.0-py38-cpu

COPY requirements.txt ./

RUN sudo apt-get update && sudo apt-get install -y curl gnupg lsb-core && \
    curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add - && \
    echo "deb [arch=amd64] https://packages.microsoft.com/ubuntu/$(lsb_release -sr)/prod $(lsb_release -sc) main" | \
    sudo tee /etc/apt/sources.list.d/mssql-release.list && \
    sudo apt-get update && \
    sudo ACCEPT_EULA=Y apt-get install -y msodbcsql17 mssql-tools unixodbc-dev build-essential unixodbc

RUN pip install --upgrade -r requirements.txt
  • [x ] I have verified my script runs in a clean environment and reproduces the issue.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (12 by maintainers)

Most upvoted comments

You would be able to toggle this via the TUNE_SYNC_DISABLE_BOOTSTRAP environment variable