modin: Unexpected behavior when attempt to filter a DF from column in another DF; works with pandas or when os.environ["MODIN_CPUS"] = "1" applied

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Modin version (modin.__version__): 0.9.1
  • Python version: 3.8.11 via conda 4.10.3
  • Code we can use to reproduce: –method 1 df = df[~df[“field_name”].isin(suppression_df[“field_name”])] –OR –method 2 df = ( df.merge(suppression_df, on=[“field_name”], how=‘left’, indicator=True) .query(‘_merge == “left_only”’) .drop(columns=‘_merge’) )

df is created by merging several feather files into a single DF. The dimensions of this process don’t fluctuate. suppression_df is created via pd.read_sql with sqlalchemy (version below). This dimensions of this process don’t fluctuate.

Describe the problem

Goal: to filter an existing data frame by a column in another data frame of the same name. I attempted both methods stated above. The filtering works in that some values are filtered. However, not all the values are filtered, some remain. The number of records that aren’t filtered out changes every run.

After a while I suspected that the parallelism in Modin might be causing things to go awry. I switched to pandas and wham, every record/row that should’ve been dropped was dropped/filtered. With pandas the number of records/rows that were dropped remained the same every time I ran the script.

I then switched back to Modin and used this statement (os.environ[“MODIN_CPUS”] = “1”) to limit the cores to 1 (to emulate pandas). And sure enough, the filtering works as expected (all records/rows that should be dropped/filtered were).

Any thoughts about this?

Source code / logs

IDK how to do this. I’m happy to add more info but will require some assistance.

Using ray, version 1.1.0 I saw this somewhere in the documentation and use it: ray.init( ignore_reinit_error=True, log_to_driver=False )

IPython 7.27.0 PyCharm community 2020.3 sqlalchemy 1.4.22

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 29 (4 by maintainers)

Most upvoted comments

Hi @SaintRod ! You can create test database using pandas or modin to make a reproducer, see example below:

import modin.pandas as pd
# modin util to check equality of frames
from modin.pandas.test.utils import df_equals
import pandas
import os

os.environ["MODIN_CPUS"] = "12"
table_name = "test_table"
query = f"SELECT * FROM {table_name}"
conn = "sqlite:///test_data.db"

# get test dataset
df = pandas.read_csv("s3://modin-datasets/cloud/taxi/trips_xaa.csv")
# put test data to sqlite database
df.to_sql(table_name, conn, index=False)

df_pandas = pandas.read_sql(query, conn)
df_pd = pd.read_sql(query, conn)

df_equals(df_pandas, df_pd) # frames are equal

In this example sqlite database and portion of NYC taxi benchmark dataset were used and it works fine, so it looks like the issue you faced can occur with certain database engine or with some specific data. Can you please provide us some more details about your case? Maybe you can provide us with database name you use, or maybe you can modify dataset from example above to trigger your issue?

Hi @SaintRod! No worries at all - I think we should probably just close this ticket for now, since we no longer know if it’s reproducible or not!

@SaintRod this error looks strange, but sometimes Ray engine can work incorrectly on Windows, please check reproducer with Dask engine or using Docker container to avoid different undesired side effects. Example dockerfile and reproducer script (slightly modified for proper operation with Dask):

Dockerfile
FROM python:3.8-windowsservercore

RUN pip install --no-cache-dir modin[all] pytest sqlalchemy s3fs

CMD [ "cmd.exe" ]
python_example.py
import modin.pandas as pd
# modin util to check equality of frames
from modin.pandas.test.utils import df_equals
import pandas
import os

os.environ["MODIN_CPUS"] = "2"
table_name = "test_table"
query = f"SELECT * FROM {table_name}"
conn = "sqlite:///test_data.db"

if __name__ == "__main__":
  # get test dataset
  df = pandas.read_csv("s3://modin-datasets/cloud/taxi/trips_xaa.csv")
  # put test data to sqlite database
  df.to_sql(table_name, conn, index=False)
  
  df_pandas = pandas.read_sql(query, conn)
  df_pd = pd.read_sql(query, conn)

  df_equals(df_pandas, df_pd) # frames are equal

You can run it using docker command docker run -it -v <path_to_dir_with_dockerfile_and_reproducer>:C:\my_data -w C:\my_data -e MODIN_ENGINE=dask python_example2 python python_example.py