modin: dfsql tests failing on Windows/MacOS after the last modin update

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows, MacOS
  • Modin version (modin.__version__): 0.10.1
  • Python version: 3.7, 3.8
  • Code we can use to reproduce:

Describe the problem

We have been getting hanging unit-tests in Github actions since upgrading to latest modin. I haven’t been able to find what the problem is exactly, but tests just hang forever.

I am creating this issue in case the problem is modin-related.

Source code / logs

Adding timeouts to tests revealed such logs on Windows:

(pid=6528) Windows fatal exception: access violation
(pid=6528) 
(pid=420) Windows fatal exception: access violation
(pid=420) 
(pid=5080) Windows fatal exception: access violation
(pid=5080) 
(pid=3592) Windows fatal exception: access violation
(pid=3592) 
(pid=6112) Windows fatal exception: access violation
(pid=6112) 
(pid=5660) Windows fatal exception: access violation
(pid=5660) 
(pid=6404) Windows fatal exception: access violation
(pid=6404) 
(pid=3924) Windows fatal exception: access violation
(pid=3924) 
(pid=3684) Windows fatal exception: access violation
(pid=3684) 

It might be related to ray saving logs, as in this issue. Weird that the issue is old, but these messages didn’t appear earlier.

On Windows + python 3.7 (but not 3.8) this Segfault happens, which seems to be ray/modin related:

Thread 0x00001b24 (most recent call first):
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\ray\worker.py", line 1637 in wait
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\ray\_private\client_mode_hook.py", line 62 in wrapper
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\engines\ray\generic\io.py", line 198 in to_csv
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\data_management\factories\factories.py", line 398 in _to_csv
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\data_management\factories\dispatcher.py", line 267 in to_csv
  File "c:\hostedtoolcache\windows\python\3.7.9\x64\lib\site-packages\modin\pandas\base.py", line 2513 in to_csv
  File "d:\a\dfsql\dfsql\dfsql\__init__.py", line 25 in sql_query
  File "d:\a\dfsql\dfsql\dfsql\extensions.py", line 66 in __call__
  File "D:\a\dfsql\dfsql\tests\test_extensions.py", line 47 in test_df_sql_nested_select_in
...
D:\a\_temp\1ac47c42-bfc7-4fcd-b689-944e647c7102.sh: line 1:  1841 Segmentation fault 

The full logs are available here: https://github.com/mindsdb/dfsql/pull/19/checks?check_run_id=3112462155

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 27 (9 by maintainers)

Most upvoted comments

I was able to reproduce hanging behavior locally (Reproducibility is not 100%).

Environment:

conda env create -f environment-dev.yml
set MODIN_CPUS=4
set MODIN_ENGINE=ray
pytest modin\pandas\test\test_io.py::TestCsv::test_hanging_behavior --verbose -s

Simplified reproducer (that should be added to TestCsv class):

def test_hanging_behavior(self):
    for i in range(16):
        #print("to_csv")
        pd.DataFrame([1, 2, 3, 4]).to_csv("initial-data.csv", index=False)
        #print("read_csv")
        df = pd.read_csv("initial-data.csv")
        #print("isnull, all, axis=1")
        df.index[df.isnull().all(axis=1)].values.tolist()
        #print("isnull, all, axis=0")
        df.columns[df.isnull().all(axis=0)].values.tolist()

Logs:

...\modin>pytest modin\pandas\test\test_io.py::TestCsv::test_hanging_behavior --verbose -s
=============================================== test session starts ===============================================
platform win32 -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- ...\Miniconda3\envs\modin\python.exe
cachedir: .pytest_cache
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: ...\modin, configfile: setup.cfg
plugins: benchmark-3.4.1, cov-2.11.0, forked-1.3.0, xdist-2.3.0
collected 1 item

modin/pandas/test/test_io.py::TestCsv::test_just_test to_csv
read_csv
(pid=20528) Windows fatal exception: access violation
(pid=20528)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=15432) Windows fatal exception: access violation
(pid=15432)
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=23128) Windows fatal exception: access violation
(pid=23128)
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=3412) Windows fatal exception: access violation
(pid=3412)
isnull, all, axis=0
to_csv
read_csv
(pid=20984) Windows fatal exception: access violation
(pid=20984)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
(pid=7096) Windows fatal exception: access violation
(pid=7096)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=23464) Windows fatal exception: access violation
(pid=23464)
isnull, all, axis=0
to_csv
read_csv
(pid=12504) Windows fatal exception: access violation
(pid=12504)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=11500) Windows fatal exception: access violation
(pid=11500) 
isnull, all, axis=0
to_csv
read_csv
(pid=19948) Windows fatal exception: access violation
(pid=19948)
isnull, all, axis=1
isnull, all, axis=0
to_csv
read_csv
isnull, all, axis=1
(pid=20848) Windows fatal exception: access violation
(pid=20848) 
isnull, all, axis=0
to_csv
2021-08-25 20:03:18,393 WARNING worker.py:1189 -- The actor or task with ID c6cf2fddfe5e7c90b398e5da6a4450ee63f746a18d1ec44e cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining 
{4.000000/4.000000 CPU, 13.969839 GiB/13.969839 GiB memory, 13.969839 GiB/13.969839 GiB object_store_memory, 1.000000/1.000000 node:10.147.230.30}
. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources 
being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.

@rkooo567 did this clarify anything?

Finally the issue turned out to be Modin related. Ray hanged when I was trying to write a Modin DataFrame to disk while using Ray. It seems there was some kind of a deadlock, but I am still not sure.

For now I resolved it by using Pandas to write to disk, after which the issue was gone: https://github.com/mindsdb/dfsql/pull/19/files#diff-287da181ac34dcb8710924d3be04f46fac4c8b26c7de303766af97d571d1b969R26

Still, it’s wroth investigating why that was happening.