SMARTS: RayOutOfMemoryError when running ULTRA experiments
Issue
After running an ULTRA training experiment (this experiment was run with the baseline DQN policy) for about half a day, the program stops because of a RayOutOfMemoryError.
Error
2021-02-09 09:09:34,529 ERROR worker.py:987 -- Possible unhandled error from worker: ray::ultra.evaluate.evaluate() (pid=4336, ip=10.208.237.111)
File "python/ray/_raylet.pyx", line 408, in ray._raylet.execute_task
File "/SMARTS/.venv/lib/python3.7/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node gpu-machine is used (30.17 / 31.29 GB). The top 10 memory consumers are:
PID MEM COMMAND
4331 18.49GiB ray::__main__.train()
17726 4.02GiB /SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/scl envision start -s ./ultra/scenarios -p 8081
4336 3.29GiB ray::IDLE
4644 0.22GiB ray::__main__.train()
4213 0.21GiB /SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/tensorboard --logdir_spec=BDQN:logs/experiment-2021.2.
4274 0.09GiB python -u ultra/train.py --task 1 --level easy
4325 0.09GiB ray::IDLE
4335 0.09GiB ray::IDLE
4334 0.09GiB ray::IDLE
4330 0.09GiB ray::IDLE
In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
---
Will send @Gamenot the full log of the program execution internally as I am unable to upload the log to this public post.
Configuration
Was run in a Docker container with Ubuntu 18.04. nvidia-smi
outputs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:01:00.0 Off | N/A |
| 25% 36C P2 34W / 215W | 896MiB / 7979MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
The command $ sumo
outputs:
Eclipse SUMO sumo Version 1.8.0
Build features: Linux-4.15.0-124-generic x86_64 GNU 7.5.0 Release Proj GUI SWIG GDAL GL2PS
Copyright (C) 2001-2020 German Aerospace Center (DLR) and others; https://sumo.dlr.de
License EPL-2.0: Eclipse Public License Version 2 <https://eclipse.org/legal/epl-v20.html>
Use --help to get the list of options.
In the virtual environment, $ pip list
yields the following:
Package Version Location
---------------------- --------- --------
absl-py 0.11.0
aiohttp 3.7.3
apipkg 1.5
argon2-cffi 20.1.0
astunparse 1.6.3
async-timeout 3.0.1
atari-py 0.2.6
attrs 20.3.0
Automat 20.2.0
backcall 0.2.0
beautifulsoup4 4.9.3
bleach 3.2.1
cachetools 4.2.0
certifi 2020.12.5
cffi 1.14.4
chardet 3.0.4
click 7.1.2
cloudpickle 1.3.0
colorama 0.4.4
commonmark 0.9.1
constantly 15.1.0
coverage 5.3.1
cycler 0.10.0
decorator 4.4.2
defusedxml 0.6.0
dill 0.3.3
dm-tree 0.1.5
entrypoints 0.3
evdev 1.4.0
execnet 1.7.1
filelock 3.0.12
future 0.18.2
gast 0.3.3
gitdb 4.0.5
GitPython 3.1.12
google 3.0.0
google-auth 1.24.0
google-auth-oauthlib 0.4.2
google-pasta 0.2.0
grpcio 1.30.0
gym 0.18.0
h5py 2.10.0
hyperlink 21.0.0
idna 2.10
imageio 2.9.0
importlib-metadata 3.4.0
importlib-resources 5.0.0
incremental 17.5.0
iniconfig 1.1.1
ipykernel 5.4.3
ipython 7.19.0
ipython-genutils 0.2.0
jedi 0.18.0
Jinja2 2.11.2
joblib 1.0.0
jsonpatch 1.28
jsonpointer 2.0
jsonschema 3.2.0
jupyter-client 6.1.11
jupyter-core 4.7.0
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
lz4 3.1.2
Markdown 3.3.3
MarkupSafe 1.1.1
matplotlib 3.3.3
mistune 0.8.4
msgpack 1.0.2
multidict 5.1.0
nbconvert 5.6.1
nbdime 2.1.0
nbformat 5.1.2
networkx 2.5
notebook 6.2.0
numpy 1.18.5
oauthlib 3.1.0
opencv-python 4.5.1.48
opencv-python-headless 4.5.1.48
opt-einsum 3.3.0
packaging 20.8
panda3d 1.10.8
panda3d-gltf 0.12
panda3d-simplepbr 0.7
pandas 1.2.0
pandocfilters 1.4.3
parso 0.8.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
pip 20.3.3
pluggy 0.13.1
prometheus-client 0.9.0
prompt-toolkit 3.0.10
protobuf 3.14.0
psutil 5.8.0
ptyprocess 0.7.0
py 1.10.0
py-cpuinfo 7.0.0
py-spy 0.3.4
pyasn1 0.4.8
pyasn1-modules 0.2.8
pybullet 3.0.8
pycparser 2.20
pyglet 1.5.0
Pygments 2.7.4
PyHamcrest 2.0.2
pynput 1.7.2
pyparsing 2.4.7
pyrsistent 0.17.3
pytest 6.2.1
pytest-benchmark 3.2.3
pytest-cov 2.11.0
pytest-forked 1.3.0
pytest-notebook 0.6.1
pytest-xdist 2.2.0
python-dateutil 2.8.1
python-xlib 0.29
pytz 2020.5
PyWavelets 1.1.1
PyYAML 5.3.1
pyzmq 21.0.1
ray 0.8.6
redis 3.4.1
requests 2.25.1
requests-oauthlib 1.3.0
rich 9.8.2
rsa 4.7
Rtree 0.9.7
scikit-image 0.18.1
scikit-learn 0.24.0
scipy 1.4.1
Send2Trash 1.5.0
setuptools 47.1.0
sh 1.14.1
Shapely 1.7.1
six 1.15.0
sklearn 0.0
smarts 0.4.11 /SMARTS
smmap 3.0.4
soupsieve 2.1
supervisor 4.2.1
tableprint 0.9.1
tabulate 0.8.7
tensorboard 2.2.2
tensorboard-plugin-wit 1.7.0
tensorboardX 2.1
tensorflow 2.2.1
tensorflow-estimator 2.2.0
termcolor 1.1.0
terminado 0.9.2
testpath 0.4.4
threadpoolctl 2.1.0
tifffile 2021.1.14
toml 0.10.2
torch 1.4.0
torchfile 0.1.0
torchvision 0.5.0
tornado 6.1
traitlets 5.0.5
trimesh 3.9.1
Twisted 20.3.0
typing-extensions 3.7.4.3
urllib3 1.26.2
visdom 0.1.8.9
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.57.0
Werkzeug 1.0.1
wheel 0.36.2
wrapt 1.12.1
yarl 1.6.3
yattag 1.14.0
zipp 3.4.0
zope.interface 5.2.0
Steps to Reproduce
- Once the repository is downloaded, SUMO is installed, and the virtual environment is created and activated with the listed packages:
$ cd SMARTS/ $ git checkout master # The current master branch available (latest commit: ebd72d6) $ scl scenario build-all ultra/scenarios/pool $ python ultra/scenarios/interface.py generate --task 1 --level easy $ ./ultra/env/envision_base.sh
- Then go into ultra/train.py and modify the line:
topolicy_class = "ultra.baselines.sac:sac-v0"
policy_class = "ultra.baselines.dqn:dqn-v0"
- And then finally run the training, redirecting the output to a text file:
View the file with$ ray stop $ nohup python -u ultra/train.py --task 1 --level easy > log.txt &
$ tail -f log.txt
Notes
- I do not encounter this error when running the training in headless mode, i.e. when running
nohup python -u ultra/train.py --task 1 --level easy --headless True > log.txt &
- I was running the training in the ULTRA Docker container created from the ULTRA Dockerfile, however I anticipate the error will occur independent of Docker?
Let me know what I missed adding or if any other information would be helpful.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21 (16 by maintainers)
Glad it could help! Sure, I can bring that up with the ULTRA team.
Update: Headless mode is now the default (#703).