clearml: Docker-Agent Stuck

Describe the bug

I am trying to create a self hosted clearml. I am creating a docker agent on the same machine. when I try to enqueue the task, the runner is getting stuck indefinately on the step

Running Docker: Executing: [‘docker’, ‘run’, ‘-t’, ‘-v’, ‘/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners:/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners’, ‘-e’, ‘SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners’, ‘-l’, ‘clearml-worker-id=AshaalL02:cpu:0’, ‘-l’, ‘clearml-parent-worker-id=AshaalL02:cpu:0’, ‘-e’, ‘CLEARML_WORKER_ID=AshaalL02:cpu:0’, ‘-e’, ‘CLEARML_DOCKER_IMAGE=python:3.9-bullseye’, ‘-e’, ‘CLEARML_TASK_ID=5fc9dfa25cd44f9790bbb8df0d2e7b23’, ‘-v’, ‘/Users/abdulraheemshaal/.gitconfig:/root/.gitconfig’, ‘-v’, ‘/var/folders/xm/27jjjrp13y9bq3657smh4c780000gp/T/.clearml_agent.yuogvi0z.cfg:/tmp/clearml.conf’, ‘-e’, ‘CLEARML_CONFIG_FILE=/tmp/clearml.conf’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/apt-cache:/var/cache/apt/archives’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/pip-cache:/root/.cache/pip’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/pip-download-cache:/root/.clearml/pip-download-cache’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/cache:/clearml_agent_cache’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/vcs-cache:/root/.clearml/vcs-cache’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/venvs-cache:/root/.clearml/venvs-cache’, ‘–rm’, ‘python:3.9-bullseye’, ‘bash’, ‘-c’, ‘echo 'Binary::apt::APT::Keep-Downloaded-Packages “true”;' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL=“$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0” ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL=“$CLEARML_APT_INSTALL git” ; declare LOCAL_PYTHON ; [ ! -z $LOCAL_PYTHON ] || for i in {15…5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL=“$CLEARML_APT_INSTALL python3-pip” ; [ -z “$CLEARML_APT_INSTALL” ] || (apt-get update -y ; apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U “pip<20.2 ; python_version < '3.10'” “pip<22.3 ; python_version >= '3.10'” ; $LOCAL_PYTHON -m pip install -U clearml-agent ; echo 'we reached here' ; cp /tmp/clearml.conf ~/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=none $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 5fc9dfa25cd44f9790bbb8df0d2e7b23’] I do check if there is a docker instance running with docker ps, and I do see one with its logs stuck at pip 22.0.4 from /usr/local/lib/python3.9/site-packages/pip (python 3.9) Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB] Get:2 http://deb.debian.org/debian-security bullseye-security InRelease [48.4 kB] Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB] Get:4 http://deb.debian.org/debian bullseye/main arm64 Packages [8072 kB] Get:5 http://deb.debian.org/debian-security bullseye-security/main arm64 Packages [233 kB] Get:6 http://deb.debian.org/debian bullseye-updates/main arm64 Packages [12.0 kB] Fetched 8525 kB in 3s (2594 kB/s) Reading package lists… Done Reading package lists… Done Building dependency tree… Done Reading state information… Done libglib2.0-0 is already the newest version (2.66.8-1). libglib2.0-0 set to manually installed. libsm6 is already the newest version (2:1.2.3-1). libsm6 set to manually installed. libxext6 is already the newest version (2:1.3.3-1.1). libxext6 set to manually installed. libxrender-dev is already the newest version (1:0.9.10-1). libxrender-dev set to manually installed. 0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded. Ignoring pip: markers ‘python_version >= “3.10”’ don’t match your environment Collecting pip<20.2 Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB) Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 22.0.4 Uninstalling pip-22.0.4: Successfully uninstalled pip-22.0.4 Successfully installed pip-20.1.1 WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Collecting clearml-agent Using cached clearml_agent-1.5.2-py3-none-any.whl (401 kB) Collecting jsonschema<5.0.0,>=2.6.0 Using cached jsonschema-4.17.3-py3-none-any.whl (90 kB) Collecting attrs<23.0.0,>=18.0 Using cached attrs-22.2.0-py3-none-any.whl (60 kB) Processing /root/.cache/pip/wheels/74/d1/7d/d9ae7d9aea0f1cebed73f37868df7b5f3333e7f30163b3e558/psutil-5.9.5-cp39-abi3-linux_aarch64.whl Collecting python-dateutil<2.9.0,>=2.4.2 Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting pyjwt<2.7.0,>=2.4.0 Using cached PyJWT-2.6.0-py3-none-any.whl (20 kB) Collecting pyparsing<3.1.0,>=2.0.3 Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB) Collecting PyYAML<6.1,>=3.12 Using cached PyYAML-6.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (731 kB) Collecting pathlib2<2.4.0,>=2.3.0 Using cached pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB) Collecting virtualenv<21,>=16 Using cached virtualenv-20.22.0-py3-none-any.whl (3.2 MB) Collecting furl<2.2.0,>=2.0.0 Using cached furl-2.1.3-py2.py3-none-any.whl (20 kB) Collecting requests<2.29.0,>=2.20.0 Using cached requests-2.28.2-py3-none-any.whl (62 kB) Collecting urllib3<1.27.0,>=1.21.1 Using cached urllib3-1.26.15-py2.py3-none-any.whl (140 kB) Collecting six<1.17.0,>=1.13.0 Using cached six-1.16.0-py2.py3-none-any.whl (11 kB) Collecting pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 Using cached pyrsistent-0.19.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (117 kB) Collecting distlib<1,>=0.3.6 Using cached distlib-0.3.6-py2.py3-none-any.whl (468 kB) Collecting filelock<4,>=3.11 Using cached filelock-3.12.0-py3-none-any.whl (10 kB) Collecting platformdirs<4,>=3.2 Using cached platformdirs-3.2.0-py3-none-any.whl (14 kB) Collecting orderedmultidict>=1.0.1 Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB) Collecting idna<4,>=2.5 Using cached idna-3.4-py3-none-any.whl (61 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (196 kB) Collecting certifi>=2017.4.17 Using cached certifi-2022.12.7-py3-none-any.whl (155 kB) Installing collected packages: attrs, pyrsistent, jsonschema, psutil, six, python-dateutil, pyjwt, pyparsing, PyYAML, pathlib2, distlib, filelock, platformdirs, virtualenv, orderedmultidict, furl, idna, charset-normalizer, urllib3, certifi, requests, clearml-agent Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-3.2.0 psutil-5.9.5 pyjwt-2.6.0 pyparsing-3.0.9 pyrsistent-0.19.3 python-dateutil-2.8.2 requests-2.28.2 six-1.16.0 urllib3-1.26.15 virtualenv-20.22.0 WARNING: You are using pip version 20.1.1; however, version 23.1 is available. You should consider upgrading via the ‘/usr/local/bin/python3.9 -m pip install --upgrade pip’ command.

If I do add execute custom script, it executes it then hangs.

I tried to do the same, with a local agent docker and the clearml app, it worked fine. The issue is happening with my self hosted deployment.

This is the docker-compose for the deployment that I am using

version: "3.6"
services:

  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - /opt/clearml/data/fileserver:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
      CLEARML__apiserver__pre_populate__enabled: "true"
      CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
      CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
      CLEARML__services__async_urls_delete__enabled: "true"
    ports:
    - "8008:8008"
    networks:
      - backend
      - frontend

  elasticsearch:
    networks:
      - backend
    container_name: clearml-elastic
    environment:
      ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
      ELASTIC_PASSWORD: ${ELASTIC_PASSWORD}
      bootstrap.memory_lock: "true"
      cluster.name: clearml
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      cluster.routing.allocation.disk.watermark.low: 500mb
      cluster.routing.allocation.disk.watermark.high: 500mb
      cluster.routing.allocation.disk.watermark.flood_stage: 500mb
      discovery.zen.minimum_master_nodes: "1"
      discovery.type: "single-node"
      http.compression_level: "7"
      node.ingest: "true"
      node.name: clearml
      reindex.remote.whitelist: '*.*'
      xpack.monitoring.enabled: "false"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.7
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/elastic_7:/usr/share/elasticsearch/data
      - /usr/share/elasticsearch/logs

  fileserver:
    networks:
      - backend
      - frontend
    command:
    - fileserver
    container_name: clearml-fileserver
    image: allegroai/clearml:latest
    environment:
      CLEARML__fileserver__delete__allow_batch: "true"
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/data/fileserver:/mnt/fileserver
    - /opt/clearml/config:/opt/clearml/config
    ports:
    - "8081:8081"

  mongo:
    networks:
      - backend
    container_name: clearml-mongo
    image: mongo:4.4.9
    restart: unless-stopped
    command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
    volumes:
    - /opt/clearml/data/mongo_4/db:/data/db
    - /opt/clearml/data/mongo_4/configdb:/data/configdb

  redis:
    networks:
      - backend
    container_name: clearml-redis
    image: redis:5.0
    restart: unless-stopped
    volumes:
    - /opt/clearml/data/redis:/data

  webserver:
    command:
    - webserver
    container_name: clearml-webserver
    # environment:
    #  CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
    image: allegroai/clearml:latest
    restart: unless-stopped
    depends_on:
      - apiserver
    ports:
    - "8080:80"
    networks:
      - backend
      - frontend

  async_delete:
    depends_on:
      - apiserver
      - redis
      - mongo
      - elasticsearch
      - fileserver
    container_name: async_delete
    image: allegroai/clearml:latest
    networks:
      - backend
    restart: unless-stopped
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      PYTHONPATH: /opt/clearml/apiserver
      CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
    entrypoint:
      - python3
      - -m
      - jobs.async_urls_delete
      - --fileserver-host
      - http://fileserver:8081
    volumes:
      - /opt/clearml/logs:/var/log/clearml

  agent-services:
    networks:
      - backend
    container_name: clearml-agent-services
    image: allegroai/clearml-agent-services:latest
    deploy:
      restart_policy:
        condition: on-failure
    privileged: true
    environment:
      CLEARML_HOST_IP: http://apiserver:8008
      CLEARML_WEB_HOST: http://webserver:8080
      CLEARML_API_HOST: http://apiserver:8008
      CLEARML_FILES_HOST: http://fileserver:8081
      CLEARML_API_ACCESS_KEY: 0N5ZH1KE4IP569EUBSFC
      CLEARML_API_SECRET_KEY: JulUROVcu94KiyLzGDFAQIYY2yYR8dcOnHTxUdikLthDs98oyk
      CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
      CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
      CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
      CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
      AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
      AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
      AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
      GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
      CLEARML_WORKER_ID: "clearml-services"
      CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/clearml/agent:/root/.clearml
    depends_on:
      - apiserver
    entrypoint: >
      bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused 'http://apiserver:8008/debug.ping' && /usr/agent/entrypoint.sh"

networks:
  backend:
    driver: bridge
  frontend:
    driver: bridge

I also tried to run the runner from sudo user, it did not change the outcome. I am completely stuck with this

To reproduce

Create local deployment with ubuntu or MacOS Clone any experiment, enqueue it. Create an agent with docker configuration.

Expected behaviour

I am expecting it to run the enqueued task instead of getting stuck, the same it does with the clearml-app

Environment

  • Server type self hosted
  • ClearML SDK Version
  • ClearML Server Version (Only for self hosted). 1.10.0-357
  • Python Version 3.9 and 3.10
  • OS (Windows \ Linux \ Macos) MACOS

Related Discussion

If this continues a slack thread, please provide a link to the original slack thread.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 15 (6 by maintainers)

Most upvoted comments

Oh right, missed that 🙂 I’ll see what we can do to add that 👍

figured it out! to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it! as this gives the docker container that gets launched connection to localhost services