ray: Ray Train: xgboost training causes OOMs when using scipy.sparse matrices
What happened + What you expected to happen
Bug: During xgboost training using ray train, the node mem usage continuously increases and there is an OOM. The object store memory is pretty low, the source of the memory increase isn’t known.
Expected behavior: training completes
Logs:
2022-10-05 23:57:23,026 WARNING worker.py:1829 -- Warning: The actor GBDTTrainable is very large (18 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
2022-10-05 23:57:23,123 WARNING util.py:220 -- The `start_trial` operation took 2.046 s, which may be a performance bottleneck.
2022-10-06 00:25:22,311 WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: f5009a0b2acb808416c8769d74b4d40d7f1b059401000000 Worker ID: 87529db22290f1425ab666d214d7354e259a981b414b90604204beb3 Node ID: 3085d388f09bdd820e4644f3e1898ddcec9a77ac69cf19c1d88f80e7 Worker IP address: 10.1.23.63 Worker port: 38407 Worker PID: 14227 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Versions / Dependencies
Ray version, installed in notebook 2.0.0. Other libraries:
aiohttp==3.8.0
aiosignal==1.2.0
anyio==3.4.0
argon2-cffi==21.1.0
async-timeout==4.0.0
attrs==21.2.0
Babel==2.9.1
backcall==0.2.0
backports.zoneinfo==0.2.1
beautifulsoup4==4.9.3
bleach==4.1.0
bokeh==2.4.2
boto3==1.19.10
botocore==1.22.11
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
cgroupspy==0.2.1
chardet==4.0.0
charset-normalizer==2.0.7
click==7.1.2
cloudpickle==2.0.0
colorama==0.4.4
commonmark==0.9.1
conda==4.3.16
cryptography==3.3.2
cycler==0.11.0
Cython==0.29.24
dask==2021.11.2
dask-labextension==5.1.0
dateparser==0.7.6
debugpy==1.5.1
decorator==5.1.0
defusedxml==0.7.1
dill==0.3.4
distlib==0.3.6
distributed==2021.11.2
docker==4.1.0
entrypoints==0.3
fastparquet==0.4.1
feedparser==5.2.1
filelock==3.8.0
frozenlist==1.2.0
fsspec==2022.7.1
google-auth==2.3.3
grafanalib==0.6.3
graphviz==0.14.1
grpcio==1.43.0
HeapDict==1.0.1
idna==3.3
importlib-resources==5.4.0
influxdb-client==1.21.0
ipykernel==6.6.0
ipython==7.30.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
jedi==0.18.1
Jinja2==3.0.3
jmespath==0.10.0
json5==0.9.6
jsonschema==4.2.1
jupyter==1.0.0
jupyter-client==7.1.0
jupyter-console==6.4.0
jupyter-core==4.9.1
jupyter-server==1.12.1
jupyter-server-proxy==3.2.0
jupyterlab==3.2.4
jupyterlab-pygments==0.1.2
jupyterlab-server==2.8.2
jupyterlab-widgets==1.0.2
kiwisolver==1.3.2
kubernetes==20.13.0
llvmlite==0.36.0
locket==0.2.1
MarkupSafe==2.0.1
matplotlib==3.4.3
matplotlib-inline==0.1.3
memory-profiler==0.60.0
mistune==0.8.4
msgpack==1.0.2
multidict==5.2.0
nbclassic==0.3.4
nbclient==0.5.9
nbconvert==6.3.0
nbformat==5.1.3
nest-asyncio==1.5.4
networkx==2.6.3
nodejs==0.1.1
notebook==6.4.6
numba==0.53.1
numpy==1.21.4
nvidia-ml-py==11.450.51
oauthlib==3.1.1
optional-django==0.1.0
packaging==21.2
pandas==1.0.5
pandocfilters==1.5.0
parso==0.8.3
partd==1.2.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.3.1
pkg_resources==0.0.0
platformdirs==2.5.2
prometheus-client==0.12.0
prompt-toolkit==3.0.23
protobuf==3.20.1
psutil==5.8.0
ptyprocess==0.7.0
py-spy==0.3.10
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycosat==0.6.3
pycparser==2.20
pydantic==1.8.2
Pygments==2.10.0
pygooglenews==0.1.2
Pympler==0.9
pyparsing==2.4.7
PyPDF2==1.26.0
pyrsistent==0.18.0
python-dateutil==2.8.2
python-snappy==0.5.4
pytz==2019.3
PyYAML==5.4.1
pyzmq==22.3.0
qtconsole==5.2.1
QtPy==1.11.2
ray==2.0.0
regex==2021.8.28
requests==2.26.0
requests-oauthlib==1.3.0
rich==10.9.0
rsa==4.7.2
ruamel.yaml==0.17.17
ruamel.yaml.clib==0.2.6
Rx==3.2.0
s3fs==0.4.2
s3transfer==0.5.0
scalene==1.3.12
scipy==1.7.0
Send2Trash==1.8.0
simpervisor==0.4
six==1.16.0
sniffio==1.2.0
snorkelflow @ file:///tmp/tmpn5a61l54/snorkelflow-0.0.0.1664836246-py3-none-any.whl
sortedcontainers==2.4.0
soupsieve==2.2.1
statsd==3.3.0
tabulate==0.8.10
tblib==1.7.0
tenacity==6.3.1
tensorboardX==2.5.1
terminado==0.12.1
testpath==0.5.0
thrift==0.15.0
toml==0.10.2
toolz==0.11.2
tornado==6.1
tqdm==4.64.1
traitlets==5.1.1
typing-extensions==3.10.0.2
tzlocal==3.0
urllib3==1.26.7
virtualenv==20.16.5
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.1
widgetsnbextension==3.5.2
wrapt==1.14.1
xgboost==1.6.2
xgboost-ray==0.1.10
yarl==1.7.2
zict==2.0.0
zipp==3.6.0
Python version: 3.8.0 Instance size: AWS ec2 m5.2xlarge, image: ubuntu
Reproduction script
xgboost_ray.pdf I can provide the X and Y that are used in the notebook via ray slack or other mediums (not able to upload .npz and .npy files here)
Issue Severity
High: blocking us from onboarding to Ray Train
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 21 (10 by maintainers)
As a workaround, you should be able to use XGBoostTrainer without Ray Datasets if you want to use scipy sparse matrices directly.
Thanks for the repro @yangs16!
I looked into it, and looks like the problem does not come from Ray Datasets, but rather from XGBoost+Pandas itself.
You can see a simple example here: https://colab.research.google.com/drive/1SLRman_L4THZlTO22R9X36p48cFnS7Kw?usp=sharing
Creating a DMatrix out of the scipy sparse matrix directly works fine. However, if you convert that scipy matrix to a sparse pandas dataframe, then creating a DMatrix out of this dataframe OOMs, even when there is no Ray involved.
Seems that there is some weird interaction between sparse Pandas dataframes and XGBoost DMatrix in general. Ray Datasets does not densify the sparse pandas dataframe.
I also posted a question on the XGBoost forum for this: https://discuss.xgboost.ai/t/dmatrix-oom-with-sparse-pandas-dataframe/3096
Repartitioning happens automatically internally if the number of blocks doesn’t equal to the number of training workers.
That’s correct, but we may not always have zero copy reads in all cases.
Looks like you are using a sparse matrix which might complicate things a bit…we can do some investigation here as to why you are seeing the memory blowing up
Ray Datasets are just holding the reference to the actual data, so the size of the dataset object itself will be very small.
But if the dataset size is 863 MB then you definitely should not be OOMing…