skypilot: [Serve] GCP crendential path error with docker image and replica

Can you help me to launch sky serve auto scaling with docker image?

launch command like below:

sky serve up -n {service name} --env-file {env file path} service.yaml

servcie.yaml like below:

# service.yaml
service:
  readiness_probe: /health
  replica_policy:
    min_replicas: 1
    max_replicas: 4
    target_qps_per_replica: 3
    upscale_delay_seconds: 180
    downscale_delay_seconds: 900

# Fields below describe each replica.
resources:
  cloud: GCP
  ports: 8000
  accelerators: L4

workdir: .

setup: docker login -u ${DOCKER_ID} -p ${DOCKER_PW} {docker image repository}

run: docker run -v ~/models/:/usr/app/models -p 8000:8000 -e ENV=prod  --runtime=nvidia --gpus all {docker image path}

Error occurs with replica provisioned. maybe gcp credential not exist error.

I 03-18 05:34:02 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-a20c2158' returned non-zero exit status 1.)
I 03-18 05:34:02 replica_managers.py:121]   Traceback: Traceback (most recent call last):
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-18 05:34:02 replica_managers.py:121]     sky.launch(task,
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-18 05:34:02 replica_managers.py:121]     return _execute(
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-18 05:34:02 replica_managers.py:121]     backend.sync_file_mounts(handle, task.file_mounts,
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-18 05:34:02 replica_managers.py:121]     return f(*args, **kwargs)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-18 05:34:02 replica_managers.py:121]     return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-18 05:34:02 replica_managers.py:121]     self._execute_file_mounts(handle, all_file_mounts)
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-18 05:34:02 replica_managers.py:121]     if storage.is_directory(src):
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-18 05:34:02 replica_managers.py:121]     p = subprocess.run(command,
I 03-18 05:34:02 replica_managers.py:121]   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-18 05:34:02 replica_managers.py:121]     raise CalledProcessError(retcode, process.args,
I 03-18 05:34:02 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-a20c2158' returned non-zero exit status 1.

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 16

Most upvoted comments

Hi @sean-styleai ! Thanks for reporting the issue. Could you try to directly sky launch this YAML and to see if the error persists? Also, could you share the output of sky status in your local laptop (for more information on SkyServe Controller spec)?