ray: Ray Train: xgboost training causes OOMs when using scipy.sparse matrices

What happened + What you expected to happen

Bug: During xgboost training using ray train, the node mem usage continuously increases and there is an OOM. The object store memory is pretty low, the source of the memory increase isn’t known.

Expected behavior: training completes

Logs:

2022-10-05 23:57:23,026	WARNING worker.py:1829 -- Warning: The actor GBDTTrainable is very large (18 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
2022-10-05 23:57:23,123	WARNING util.py:220 -- The `start_trial` operation took 2.046 s, which may be a performance bottleneck.
2022-10-06 00:25:22,311	WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: f5009a0b2acb808416c8769d74b4d40d7f1b059401000000 Worker ID: 87529db22290f1425ab666d214d7354e259a981b414b90604204beb3 Node ID: 3085d388f09bdd820e4644f3e1898ddcec9a77ac69cf19c1d88f80e7 Worker IP address: 10.1.23.63 Worker port: 38407 Worker PID: 14227 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Versions / Dependencies

Ray version, installed in notebook 2.0.0. Other libraries:

aiohttp==3.8.0
aiosignal==1.2.0
anyio==3.4.0
argon2-cffi==21.1.0
async-timeout==4.0.0
attrs==21.2.0
Babel==2.9.1
backcall==0.2.0
backports.zoneinfo==0.2.1
beautifulsoup4==4.9.3
bleach==4.1.0
bokeh==2.4.2
boto3==1.19.10
botocore==1.22.11
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
cgroupspy==0.2.1
chardet==4.0.0
charset-normalizer==2.0.7
click==7.1.2
cloudpickle==2.0.0
colorama==0.4.4
commonmark==0.9.1
conda==4.3.16
cryptography==3.3.2
cycler==0.11.0
Cython==0.29.24
dask==2021.11.2
dask-labextension==5.1.0
dateparser==0.7.6
debugpy==1.5.1
decorator==5.1.0
defusedxml==0.7.1
dill==0.3.4
distlib==0.3.6
distributed==2021.11.2
docker==4.1.0
entrypoints==0.3
fastparquet==0.4.1
feedparser==5.2.1
filelock==3.8.0
frozenlist==1.2.0
fsspec==2022.7.1
google-auth==2.3.3
grafanalib==0.6.3
graphviz==0.14.1
grpcio==1.43.0
HeapDict==1.0.1
idna==3.3
importlib-resources==5.4.0
influxdb-client==1.21.0
ipykernel==6.6.0
ipython==7.30.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
jedi==0.18.1
Jinja2==3.0.3
jmespath==0.10.0
json5==0.9.6
jsonschema==4.2.1
jupyter==1.0.0
jupyter-client==7.1.0
jupyter-console==6.4.0
jupyter-core==4.9.1
jupyter-server==1.12.1
jupyter-server-proxy==3.2.0
jupyterlab==3.2.4
jupyterlab-pygments==0.1.2
jupyterlab-server==2.8.2
jupyterlab-widgets==1.0.2
kiwisolver==1.3.2
kubernetes==20.13.0
llvmlite==0.36.0
locket==0.2.1
MarkupSafe==2.0.1
matplotlib==3.4.3
matplotlib-inline==0.1.3
memory-profiler==0.60.0
mistune==0.8.4
msgpack==1.0.2
multidict==5.2.0
nbclassic==0.3.4
nbclient==0.5.9
nbconvert==6.3.0
nbformat==5.1.3
nest-asyncio==1.5.4
networkx==2.6.3
nodejs==0.1.1
notebook==6.4.6
numba==0.53.1
numpy==1.21.4
nvidia-ml-py==11.450.51
oauthlib==3.1.1
optional-django==0.1.0
packaging==21.2
pandas==1.0.5
pandocfilters==1.5.0
parso==0.8.3
partd==1.2.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.3.1
pkg_resources==0.0.0
platformdirs==2.5.2
prometheus-client==0.12.0
prompt-toolkit==3.0.23
protobuf==3.20.1
psutil==5.8.0
ptyprocess==0.7.0
py-spy==0.3.10
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycosat==0.6.3
pycparser==2.20
pydantic==1.8.2
Pygments==2.10.0
pygooglenews==0.1.2
Pympler==0.9
pyparsing==2.4.7
PyPDF2==1.26.0
pyrsistent==0.18.0
python-dateutil==2.8.2
python-snappy==0.5.4
pytz==2019.3
PyYAML==5.4.1
pyzmq==22.3.0
qtconsole==5.2.1
QtPy==1.11.2
ray==2.0.0
regex==2021.8.28
requests==2.26.0
requests-oauthlib==1.3.0
rich==10.9.0
rsa==4.7.2
ruamel.yaml==0.17.17
ruamel.yaml.clib==0.2.6
Rx==3.2.0
s3fs==0.4.2
s3transfer==0.5.0
scalene==1.3.12
scipy==1.7.0
Send2Trash==1.8.0
simpervisor==0.4
six==1.16.0
sniffio==1.2.0
snorkelflow @ file:///tmp/tmpn5a61l54/snorkelflow-0.0.0.1664836246-py3-none-any.whl
sortedcontainers==2.4.0
soupsieve==2.2.1
statsd==3.3.0
tabulate==0.8.10
tblib==1.7.0
tenacity==6.3.1
tensorboardX==2.5.1
terminado==0.12.1
testpath==0.5.0
thrift==0.15.0
toml==0.10.2
toolz==0.11.2
tornado==6.1
tqdm==4.64.1
traitlets==5.1.1
typing-extensions==3.10.0.2
tzlocal==3.0
urllib3==1.26.7
virtualenv==20.16.5
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.2.1
widgetsnbextension==3.5.2
wrapt==1.14.1
xgboost==1.6.2
xgboost-ray==0.1.10
yarl==1.7.2
zict==2.0.0
zipp==3.6.0

Python version: 3.8.0 Instance size: AWS ec2 m5.2xlarge, image: ubuntu

Reproduction script

xgboost_ray.pdf I can provide the X and Y that are used in the notebook via ray slack or other mediums (not able to upload .npz and .npy files here)

Issue Severity

High: blocking us from onboarding to Ray Train

About this issue

Original URL
State: open
Created 2 years ago
Comments: 21 (10 by maintainers)

Most upvoted comments

As a workaround, you should be able to use XGBoostTrainer without Ray Datasets if you want to use scipy sparse matrices directly.

amogkam on Feb 28, 2023

Thanks for the repro @yangs16!

I looked into it, and looks like the problem does not come from Ray Datasets, but rather from XGBoost+Pandas itself.

You can see a simple example here: https://colab.research.google.com/drive/1SLRman_L4THZlTO22R9X36p48cFnS7Kw?usp=sharing

Creating a DMatrix out of the scipy sparse matrix directly works fine. However, if you convert that scipy matrix to a sparse pandas dataframe, then creating a DMatrix out of this dataframe OOMs, even when there is no Ray involved.

Seems that there is some weird interaction between sparse Pandas dataframes and XGBoost DMatrix in general. Ray Datasets does not densify the sparse pandas dataframe.

I also posted a question on the XGBoost forum for this: https://discuss.xgboost.ai/t/dmatrix-oom-with-sparse-pandas-dataframe/3096

amogkam on Feb 28, 2023

Repartitioning happens automatically internally if the number of blocks doesn’t equal to the number of training workers.

amogkam on Oct 6, 2022

My understanding was that using ray dataset, the input train dataset is loaded onto the object store

That’s correct, but we may not always have zero copy reads in all cases.

Looks like you are using a sparse matrix which might complicate things a bit…we can do some investigation here as to why you are seeing the memory blowing up

amogkam on Oct 6, 2022

Ray Datasets are just holding the reference to the actual data, so the size of the dataset object itself will be very small.

But if the dataset size is 863 MB then you definitely should not be OOMing…

amogkam on Oct 6, 2022