modin: BUG: passing in large data object to DataFrame apply() function resulting in SegFault

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
import numpy as np

n_features = 20000
n_samples = 20000
features = [f'f{i+1}' for i in range(n_features)]
samples = [f's{i+1}' for i in range(n_samples)]

df = pd.DataFrame(np.random.rand(n_samples, n_features), columns=features, index=samples).round(2)

# mat_df = pd.DataFrame(np.random.rand(n_features, n_features), columns=features, index=features).round(2)
# res_df = df.dot(mat_df)

# this can be simplified to:
mat_arr = np.random.rand(n_features, n_features)
res_df = df.dot(mat_arr)


Issue Description

Getting Segmentation fault (core dumped) when doing matrix multiplication on 20K x 20K matrices.

I have two servers, each with 16 threads, >50GB available RAM, >300GB available storage. Code tested on both py@3.8+modin@0.23.4 and py@3.10+modin@0.24.1, both showed same result.

The code is trying to do a dot() on two 20K x 20K matrices, each matrix is roughly 3GB (showing on object store memory in one of the nodes), the seg fault happens when calling df.dot(mat_df). However, the code will work on small matrices like 500 x 500.

There’s not much to show on the terminal, basically it just says UserWarning: Ray Client is attempting to retrieve a 3.00 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead. and the program gets killed.

I have tried to increase my object store memory and also specify a plasma directory, and still getting seg fault, the problem seems to happen after the computation has finished on remote node and being transferred by to client, I saw an increase to 12GB on the object store memory , and around 3GB increase on the client’s main memory (when the warning message shows up), right when the main memory increased to 3GB, the program gets killed.

Expected Behavior

should return the transformed df.

Error Logs


Replace this line with the error backtrace (if applicable).

Installed Versions

UserWarning: Setuptools is replacing distutils.

INSTALLED VERSIONS

commit : 4c01f6464a1fb7a201a4e748c7420e5cb41d16ce python : 3.10.13.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-84-generic Version : #93~20.04.1-Ubuntu SMP Wed Sep 6 16:15:40 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.24.1 ray : 2.7.1 dask : 2023.9.3 distributed : 2023.9.3 hdk : None

pandas dependencies

pandas : 2.1.1 numpy : 1.26.0 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.16.1 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.9.2 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 13.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 23

Most upvoted comments

@anmyachev I can make it to work by not setting an RAY_ADDRESS, Modin/Ray seems to be able to pick up the address when I am running the client code on one of the Ray nodes. However, it is worth noting that I am seeing different ports after this change, before it was on port 10001, and now 6379 is being picked up.

@SiRumCz good catch! Let’s ensure (if needed) you are connecting to an existing cluster (instead of creating a new one), use ray.cluster_resources().

Thanks for all the information! I’m working to figure out what’s wrong.

@anmyachev I have an update, it seems like the problem is not from dot, but the problem might come from the network. I have removed dot operation and the code will still fail with seg fault:

import modin.pandas as pd
import numpy as np

n_features = 20000
n_samples = 20000
features = [f'f{i+1}' for i in range(n_features)]
samples = [f's{i+1}' for i in range(n_samples)]

df = pd.DataFrame(np.random.rand(n_samples, n_features), columns=features, index=samples).round(2)
mat_df = pd.DataFrame(np.random.rand(n_features, n_features), columns=features, index=features).round(2)
res_df = df.apply(lambda x,m: m.iloc[0], axis=1, m=mat_df) # this will still get seg fault

Error message:

>>> df.apply(lambda x,m: m.head(1), axis=1, m=mat_df)
UserWarning: Ray Client is attempting to retrieve a 2.99 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead.
Segmentation fault (core dumped)

The problem seems to be during or after the Client requested for the object over the network.

Update: While it works fine on simple apply func, it will give me segfault If I include the large matrix df in the apply function (even if not use it):

>>> df.apply(lambda x: x+1)
          f1    f2    f3    f4    f5    f6    f7    f8    f9   f10   f11   f12   f13   f14  ...  f19987  f19988  f19989  f19990  f19991  f19992  f19993  f19994  f19995  f19996  f19997  f19998  f19999  f20000
s1      1.87  1.05  1.88  1.67  1.84  1.12  1.58  1.68  1.33  1.81  1.21  1.33  1.14  1.93  ...    1.89    1.96    1.40    1.08    1.27    1.13    1.81    1.77    1.78    1.56    1.39    1.90    1.15    1.59
...      ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
s20000  1.28  1.24  1.66  1.27  1.47  1.07  1.06  1.24  1.84  1.57  1.90  1.36  1.01  1.08  ...    1.25    1.44    1.34    1.78    1.94    1.73    1.52    1.60    1.87    1.47    1.07    1.43    1.69    1.47

[20000 rows x 20000 columns]
>>> df.apply(lambda x, m: x+1, m=mat_df)
UserWarning: Ray Client is attempting to retrieve a 2.99 GiB object over the network, which may be slow. Consider serializing the object to a file and using S3 or rsync instead.
Segmentation fault (core dumped)

The matrix parameter in the apply func doesn’t need to be a dataframe, I tried with numpy array and it has the same problem, my guess is that the param data cannot be too large to trigger some mechanism, otherwise I will get seg fault.