SMARTS: RayOutOfMemoryError when running ULTRA experiments

Issue

After running an ULTRA training experiment (this experiment was run with the baseline DQN policy) for about half a day, the program stops because of a RayOutOfMemoryError.

Error

2021-02-09 09:09:34,529	ERROR worker.py:987 -- Possible unhandled error from worker: ray::ultra.evaluate.evaluate() (pid=4336, ip=10.208.237.111)
  File "python/ray/_raylet.pyx", line 408, in ray._raylet.execute_task
  File "/SMARTS/.venv/lib/python3.7/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node gpu-machine is used (30.17 / 31.29 GB). The top 10 memory consumers are:

PID	MEM	COMMAND
4331	18.49GiB	ray::__main__.train()
17726	4.02GiB	/SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/scl envision start -s ./ultra/scenarios -p 8081
4336	3.29GiB	ray::IDLE
4644	0.22GiB	ray::__main__.train()
4213	0.21GiB	/SMARTS/.venv/bin/python3.7 /SMARTS/.venv/bin/tensorboard --logdir_spec=BDQN:logs/experiment-2021.2.
4274	0.09GiB	python -u ultra/train.py --task 1 --level easy
4325	0.09GiB	ray::IDLE
4335	0.09GiB	ray::IDLE
4334	0.09GiB	ray::IDLE
4330	0.09GiB	ray::IDLE

In addition, up to 0.04 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
---

Will send @Gamenot the full log of the program execution internally as I am unable to upload the log to this public post.

Configuration

Was run in a Docker container with Ubuntu 18.04. nvidia-smi outputs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0 Off |                  N/A |
| 25%   36C    P2    34W / 215W |    896MiB /  7979MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The command $ sumo outputs:

Eclipse SUMO sumo Version 1.8.0
 Build features: Linux-4.15.0-124-generic x86_64 GNU 7.5.0 Release Proj GUI SWIG GDAL GL2PS
 Copyright (C) 2001-2020 German Aerospace Center (DLR) and others; https://sumo.dlr.de
 License EPL-2.0: Eclipse Public License Version 2 <https://eclipse.org/legal/epl-v20.html>
 Use --help to get the list of options.

In the virtual environment, $ pip list yields the following:

Package                Version   Location
---------------------- --------- --------
absl-py                0.11.0
aiohttp                3.7.3
apipkg                 1.5
argon2-cffi            20.1.0
astunparse             1.6.3
async-timeout          3.0.1
atari-py               0.2.6
attrs                  20.3.0
Automat                20.2.0
backcall               0.2.0
beautifulsoup4         4.9.3
bleach                 3.2.1
cachetools             4.2.0
certifi                2020.12.5
cffi                   1.14.4
chardet                3.0.4
click                  7.1.2
cloudpickle            1.3.0
colorama               0.4.4
commonmark             0.9.1
constantly             15.1.0
coverage               5.3.1
cycler                 0.10.0
decorator              4.4.2
defusedxml             0.6.0
dill                   0.3.3
dm-tree                0.1.5
entrypoints            0.3
evdev                  1.4.0
execnet                1.7.1
filelock               3.0.12
future                 0.18.2
gast                   0.3.3
gitdb                  4.0.5
GitPython              3.1.12
google                 3.0.0
google-auth            1.24.0
google-auth-oauthlib   0.4.2
google-pasta           0.2.0
grpcio                 1.30.0
gym                    0.18.0
h5py                   2.10.0
hyperlink              21.0.0
idna                   2.10
imageio                2.9.0
importlib-metadata     3.4.0
importlib-resources    5.0.0
incremental            17.5.0
iniconfig              1.1.1
ipykernel              5.4.3
ipython                7.19.0
ipython-genutils       0.2.0
jedi                   0.18.0
Jinja2                 2.11.2
joblib                 1.0.0
jsonpatch              1.28
jsonpointer            2.0
jsonschema             3.2.0
jupyter-client         6.1.11
jupyter-core           4.7.0
Keras-Preprocessing    1.1.2
kiwisolver             1.3.1
lz4                    3.1.2
Markdown               3.3.3
MarkupSafe             1.1.1
matplotlib             3.3.3
mistune                0.8.4
msgpack                1.0.2
multidict              5.1.0
nbconvert              5.6.1
nbdime                 2.1.0
nbformat               5.1.2
networkx               2.5
notebook               6.2.0
numpy                  1.18.5
oauthlib               3.1.0
opencv-python          4.5.1.48
opencv-python-headless 4.5.1.48
opt-einsum             3.3.0
packaging              20.8
panda3d                1.10.8
panda3d-gltf           0.12
panda3d-simplepbr      0.7
pandas                 1.2.0
pandocfilters          1.4.3
parso                  0.8.1
pexpect                4.8.0
pickleshare            0.7.5
Pillow                 7.2.0
pip                    20.3.3
pluggy                 0.13.1
prometheus-client      0.9.0
prompt-toolkit         3.0.10
protobuf               3.14.0
psutil                 5.8.0
ptyprocess             0.7.0
py                     1.10.0
py-cpuinfo             7.0.0
py-spy                 0.3.4
pyasn1                 0.4.8
pyasn1-modules         0.2.8
pybullet               3.0.8
pycparser              2.20
pyglet                 1.5.0
Pygments               2.7.4
PyHamcrest             2.0.2
pynput                 1.7.2
pyparsing              2.4.7
pyrsistent             0.17.3
pytest                 6.2.1
pytest-benchmark       3.2.3
pytest-cov             2.11.0
pytest-forked          1.3.0
pytest-notebook        0.6.1
pytest-xdist           2.2.0
python-dateutil        2.8.1
python-xlib            0.29
pytz                   2020.5
PyWavelets             1.1.1
PyYAML                 5.3.1
pyzmq                  21.0.1
ray                    0.8.6
redis                  3.4.1
requests               2.25.1
requests-oauthlib      1.3.0
rich                   9.8.2
rsa                    4.7
Rtree                  0.9.7
scikit-image           0.18.1
scikit-learn           0.24.0
scipy                  1.4.1
Send2Trash             1.5.0
setuptools             47.1.0
sh                     1.14.1
Shapely                1.7.1
six                    1.15.0
sklearn                0.0
smarts                 0.4.11    /SMARTS
smmap                  3.0.4
soupsieve              2.1
supervisor             4.2.1
tableprint             0.9.1
tabulate               0.8.7
tensorboard            2.2.2
tensorboard-plugin-wit 1.7.0
tensorboardX           2.1
tensorflow             2.2.1
tensorflow-estimator   2.2.0
termcolor              1.1.0
terminado              0.9.2
testpath               0.4.4
threadpoolctl          2.1.0
tifffile               2021.1.14
toml                   0.10.2
torch                  1.4.0
torchfile              0.1.0
torchvision            0.5.0
tornado                6.1
traitlets              5.0.5
trimesh                3.9.1
Twisted                20.3.0
typing-extensions      3.7.4.3
urllib3                1.26.2
visdom                 0.1.8.9
wcwidth                0.2.5
webencodings           0.5.1
websocket-client       0.57.0
Werkzeug               1.0.1
wheel                  0.36.2
wrapt                  1.12.1
yarl                   1.6.3
yattag                 1.14.0
zipp                   3.4.0
zope.interface         5.2.0

Steps to Reproduce

Once the repository is downloaded, SUMO is installed, and the virtual environment is created and activated with the listed packages:

$ cd SMARTS/
$ git checkout master  # The current master branch available (latest commit: ebd72d6)
$ scl scenario build-all ultra/scenarios/pool
$ python ultra/scenarios/interface.py generate --task 1 --level easy
$ ./ultra/env/envision_base.sh

Then go into ultra/train.py and modify the line:

policy_class = "ultra.baselines.sac:sac-v0"

policy_class = "ultra.baselines.dqn:dqn-v0"

And then finally run the training, redirecting the output to a text file:
```
$ ray stop
$ nohup python -u ultra/train.py --task 1 --level easy > log.txt &
```
View the file with $ tail -f log.txt

Notes

I do not encounter this error when running the training in headless mode, i.e. when running nohup python -u ultra/train.py --task 1 --level easy --headless True > log.txt &
I was running the training in the ULTRA Docker container created from the ULTRA Dockerfile, however I anticipate the error will occur independent of Docker?

Let me know what I missed adding or if any other information would be helpful.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (16 by maintainers)

Most upvoted comments

Glad it could help! Sure, I can bring that up with the ULTRA team.

Update: Headless mode is now the default (#703).

christianjans on Mar 23, 2021