spotty: GCP: 30 minutes for runtimeconfig.v1beta1.waiter Timeout expired

On GCP, I am using a spotty.yaml that previously worked, but does not currently. I suspect that the volume is large (2 TB) and some sort of timeout is happening.

It takes about 31 minutes when I run spotty start and I get the following error:

Waiting for the stack to be created...
  - launching the instance...
  - running the Docker container...
  Error:
  ------
  Deployment "spotty-instance-hearpreprocess-hearpreprocess-i2-joseph" failed.
  Error: {"ResourceType":"runtimeconfig.v1beta1.waiter","ResourceErrorCode":"504","ResourceErrorMessage":"Timeout expired."}

Here is my config:


project:
  name: hearpreprocess
  syncFilters:
    - exclude:
        - '*/__pycache__/*'
        - .git/*
        - .idea/*
        - .mypy_cache/*
        - _workdir/*
        - hear-2021*.tar.gz
        - hear-2021*/*
        - hearpreprocess.egg-info/*
        - tasks/*

containers:
  - projectDir: /workspace/project
    image: turian/hearpreprocess
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: hearpreprocess-i2-joseph
    provider: gcp
    parameters:
      zone: europe-west4-a
      machineType: n1-standard-8
      preemptibleInstance: False
      gpu:
        type: nvidia-tesla-v100
        count: 1
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
      volumes:
        - name: workspace
          parameters:
            size: 2000

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19

Most upvoted comments

Well, using the docker ps -a command I checked that the container exits on its own with the 137 error which usually means OOM. At first, I suspected that it’s a heavy Docker image that requires a lot of memory, but then I tried the tensorflow/tensorflow image instead - it gave me the same error. At that point, it was clear that it’s not OOM. Then I tried to google what else it could be and found this issue where people had the same problem a couple of years ago. They solved it by updating containerd to the latest version, so I tried a newer version as well and it worked 😃.

Glad it worked for you, I’m closing the issue then. Feel free to reopen if it happens again.