clearml: Docker-Agent Stuck
Describe the bug
I am trying to create a self hosted clearml. I am creating a docker agent on the same machine. when I try to enqueue the task, the runner is getting stuck indefinately on the step
Running Docker: Executing: [‘docker’, ‘run’, ‘-t’, ‘-v’, ‘/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners:/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners’, ‘-e’, ‘SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners’, ‘-l’, ‘clearml-worker-id=AshaalL02:cpu:0’, ‘-l’, ‘clearml-parent-worker-id=AshaalL02:cpu:0’, ‘-e’, ‘CLEARML_WORKER_ID=AshaalL02:cpu:0’, ‘-e’, ‘CLEARML_DOCKER_IMAGE=python:3.9-bullseye’, ‘-e’, ‘CLEARML_TASK_ID=5fc9dfa25cd44f9790bbb8df0d2e7b23’, ‘-v’, ‘/Users/abdulraheemshaal/.gitconfig:/root/.gitconfig’, ‘-v’, ‘/var/folders/xm/27jjjrp13y9bq3657smh4c780000gp/T/.clearml_agent.yuogvi0z.cfg:/tmp/clearml.conf’, ‘-e’, ‘CLEARML_CONFIG_FILE=/tmp/clearml.conf’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/apt-cache:/var/cache/apt/archives’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/pip-cache:/root/.cache/pip’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/pip-download-cache:/root/.clearml/pip-download-cache’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/cache:/clearml_agent_cache’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/vcs-cache:/root/.clearml/vcs-cache’, ‘-v’, ‘/Users/abdulraheemshaal/.clearml/venvs-cache:/root/.clearml/venvs-cache’, ‘–rm’, ‘python:3.9-bullseye’, ‘bash’, ‘-c’, ‘echo 'Binary::apt::APT::Keep-Downloaded-Packages “true”;' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL=“$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0” ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL=“$CLEARML_APT_INSTALL git” ; declare LOCAL_PYTHON ; [ ! -z $LOCAL_PYTHON ] || for i in {15…5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL=“$CLEARML_APT_INSTALL python3-pip” ; [ -z “$CLEARML_APT_INSTALL” ] || (apt-get update -y ; apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U “pip<20.2 ; python_version < '3.10'” “pip<22.3 ; python_version >= '3.10'” ; $LOCAL_PYTHON -m pip install -U clearml-agent ; echo 'we reached here' ; cp /tmp/clearml.conf ~/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=none $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 5fc9dfa25cd44f9790bbb8df0d2e7b23’] I do check if there is a docker instance running with docker ps, and I do see one with its logs stuck at pip 22.0.4 from /usr/local/lib/python3.9/site-packages/pip (python 3.9) Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB] Get:2 http://deb.debian.org/debian-security bullseye-security InRelease [48.4 kB] Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB] Get:4 http://deb.debian.org/debian bullseye/main arm64 Packages [8072 kB] Get:5 http://deb.debian.org/debian-security bullseye-security/main arm64 Packages [233 kB] Get:6 http://deb.debian.org/debian bullseye-updates/main arm64 Packages [12.0 kB] Fetched 8525 kB in 3s (2594 kB/s) Reading package lists… Done Reading package lists… Done Building dependency tree… Done Reading state information… Done libglib2.0-0 is already the newest version (2.66.8-1). libglib2.0-0 set to manually installed. libsm6 is already the newest version (2:1.2.3-1). libsm6 set to manually installed. libxext6 is already the newest version (2:1.3.3-1.1). libxext6 set to manually installed. libxrender-dev is already the newest version (1:0.9.10-1). libxrender-dev set to manually installed. 0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded. Ignoring pip: markers ‘python_version >= “3.10”’ don’t match your environment Collecting pip<20.2 Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB) Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 22.0.4 Uninstalling pip-22.0.4: Successfully uninstalled pip-22.0.4 Successfully installed pip-20.1.1 WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Collecting clearml-agent Using cached clearml_agent-1.5.2-py3-none-any.whl (401 kB) Collecting jsonschema<5.0.0,>=2.6.0 Using cached jsonschema-4.17.3-py3-none-any.whl (90 kB) Collecting attrs<23.0.0,>=18.0 Using cached attrs-22.2.0-py3-none-any.whl (60 kB) Processing /root/.cache/pip/wheels/74/d1/7d/d9ae7d9aea0f1cebed73f37868df7b5f3333e7f30163b3e558/psutil-5.9.5-cp39-abi3-linux_aarch64.whl Collecting python-dateutil<2.9.0,>=2.4.2 Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting pyjwt<2.7.0,>=2.4.0 Using cached PyJWT-2.6.0-py3-none-any.whl (20 kB) Collecting pyparsing<3.1.0,>=2.0.3 Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB) Collecting PyYAML<6.1,>=3.12 Using cached PyYAML-6.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (731 kB) Collecting pathlib2<2.4.0,>=2.3.0 Using cached pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB) Collecting virtualenv<21,>=16 Using cached virtualenv-20.22.0-py3-none-any.whl (3.2 MB) Collecting furl<2.2.0,>=2.0.0 Using cached furl-2.1.3-py2.py3-none-any.whl (20 kB) Collecting requests<2.29.0,>=2.20.0 Using cached requests-2.28.2-py3-none-any.whl (62 kB) Collecting urllib3<1.27.0,>=1.21.1 Using cached urllib3-1.26.15-py2.py3-none-any.whl (140 kB) Collecting six<1.17.0,>=1.13.0 Using cached six-1.16.0-py2.py3-none-any.whl (11 kB) Collecting pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 Using cached pyrsistent-0.19.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (117 kB) Collecting distlib<1,>=0.3.6 Using cached distlib-0.3.6-py2.py3-none-any.whl (468 kB) Collecting filelock<4,>=3.11 Using cached filelock-3.12.0-py3-none-any.whl (10 kB) Collecting platformdirs<4,>=3.2 Using cached platformdirs-3.2.0-py3-none-any.whl (14 kB) Collecting orderedmultidict>=1.0.1 Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB) Collecting idna<4,>=2.5 Using cached idna-3.4-py3-none-any.whl (61 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (196 kB) Collecting certifi>=2017.4.17 Using cached certifi-2022.12.7-py3-none-any.whl (155 kB) Installing collected packages: attrs, pyrsistent, jsonschema, psutil, six, python-dateutil, pyjwt, pyparsing, PyYAML, pathlib2, distlib, filelock, platformdirs, virtualenv, orderedmultidict, furl, idna, charset-normalizer, urllib3, certifi, requests, clearml-agent Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-3.2.0 psutil-5.9.5 pyjwt-2.6.0 pyparsing-3.0.9 pyrsistent-0.19.3 python-dateutil-2.8.2 requests-2.28.2 six-1.16.0 urllib3-1.26.15 virtualenv-20.22.0 WARNING: You are using pip version 20.1.1; however, version 23.1 is available. You should consider upgrading via the ‘/usr/local/bin/python3.9 -m pip install --upgrade pip’ command.
If I do add execute custom script, it executes it then hangs.
I tried to do the same, with a local agent docker and the clearml app, it worked fine. The issue is happening with my self hosted deployment.
This is the docker-compose for the deployment that I am using
version: "3.6"
services:
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/data/fileserver:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_ELASTIC_SERVICE_PORT: 9200
CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
CLEARML_MONGODB_SERVICE_HOST: mongo
CLEARML_MONGODB_SERVICE_PORT: 27017
CLEARML_REDIS_SERVICE_HOST: redis
CLEARML_REDIS_SERVICE_PORT: 6379
CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
CLEARML__apiserver__pre_populate__enabled: "true"
CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
CLEARML__services__async_urls_delete__enabled: "true"
ports:
- "8008:8008"
networks:
- backend
- frontend
elasticsearch:
networks:
- backend
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
ELASTIC_PASSWORD: ${ELASTIC_PASSWORD}
bootstrap.memory_lock: "true"
cluster.name: clearml
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
discovery.zen.minimum_master_nodes: "1"
discovery.type: "single-node"
http.compression_level: "7"
node.ingest: "true"
node.name: clearml
reindex.remote.whitelist: '*.*'
xpack.monitoring.enabled: "false"
xpack.security.enabled: "false"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.7
restart: unless-stopped
volumes:
- /opt/clearml/data/elastic_7:/usr/share/elasticsearch/data
- /usr/share/elasticsearch/logs
fileserver:
networks:
- backend
- frontend
command:
- fileserver
container_name: clearml-fileserver
image: allegroai/clearml:latest
environment:
CLEARML__fileserver__delete__allow_batch: "true"
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/data/fileserver:/mnt/fileserver
- /opt/clearml/config:/opt/clearml/config
ports:
- "8081:8081"
mongo:
networks:
- backend
container_name: clearml-mongo
image: mongo:4.4.9
restart: unless-stopped
command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
volumes:
- /opt/clearml/data/mongo_4/db:/data/db
- /opt/clearml/data/mongo_4/configdb:/data/configdb
redis:
networks:
- backend
container_name: clearml-redis
image: redis:5.0
restart: unless-stopped
volumes:
- /opt/clearml/data/redis:/data
webserver:
command:
- webserver
container_name: clearml-webserver
# environment:
# CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
image: allegroai/clearml:latest
restart: unless-stopped
depends_on:
- apiserver
ports:
- "8080:80"
networks:
- backend
- frontend
async_delete:
depends_on:
- apiserver
- redis
- mongo
- elasticsearch
- fileserver
container_name: async_delete
image: allegroai/clearml:latest
networks:
- backend
restart: unless-stopped
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_ELASTIC_SERVICE_PORT: 9200
CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
CLEARML_MONGODB_SERVICE_HOST: mongo
CLEARML_MONGODB_SERVICE_PORT: 27017
CLEARML_REDIS_SERVICE_HOST: redis
CLEARML_REDIS_SERVICE_PORT: 6379
PYTHONPATH: /opt/clearml/apiserver
CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
entrypoint:
- python3
- -m
- jobs.async_urls_delete
- --fileserver-host
- http://fileserver:8081
volumes:
- /opt/clearml/logs:/var/log/clearml
agent-services:
networks:
- backend
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
CLEARML_HOST_IP: http://apiserver:8008
CLEARML_WEB_HOST: http://webserver:8080
CLEARML_API_HOST: http://apiserver:8008
CLEARML_FILES_HOST: http://fileserver:8081
CLEARML_API_ACCESS_KEY: 0N5ZH1KE4IP569EUBSFC
CLEARML_API_SECRET_KEY: JulUROVcu94KiyLzGDFAQIYY2yYR8dcOnHTxUdikLthDs98oyk
CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
CLEARML_WORKER_ID: "clearml-services"
CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
SHUTDOWN_IF_NO_ACCESS_KEY: 1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused 'http://apiserver:8008/debug.ping' && /usr/agent/entrypoint.sh"
networks:
backend:
driver: bridge
frontend:
driver: bridge
I also tried to run the runner from sudo user, it did not change the outcome. I am completely stuck with this
To reproduce
Create local deployment with ubuntu or MacOS Clone any experiment, enqueue it. Create an agent with docker configuration.
Expected behaviour
I am expecting it to run the enqueued task instead of getting stuck, the same it does with the clearml-app
Environment
- Server type self hosted
- ClearML SDK Version
- ClearML Server Version (Only for self hosted). 1.10.0-357
- Python Version 3.9 and 3.10
- OS (Windows \ Linux \ Macos) MACOS
Related Discussion
If this continues a slack thread, please provide a link to the original slack thread.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 15 (6 by maintainers)
Oh right, missed that 🙂 I’ll see what we can do to add that 👍
figured it out! to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it! as this gives the docker container that gets launched connection to localhost services