modin: Unexpected behavior when attempt to filter a DF from column in another DF; works with pandas or when os.environ["MODIN_CPUS"] = "1" applied
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- Modin version (
modin.__version__
): 0.9.1 - Python version: 3.8.11 via conda 4.10.3
- Code we can use to reproduce: –method 1 df = df[~df[“field_name”].isin(suppression_df[“field_name”])] –OR –method 2 df = ( df.merge(suppression_df, on=[“field_name”], how=‘left’, indicator=True) .query(‘_merge == “left_only”’) .drop(columns=‘_merge’) )
df is created by merging several feather files into a single DF. The dimensions of this process don’t fluctuate. suppression_df is created via pd.read_sql with sqlalchemy (version below). This dimensions of this process don’t fluctuate.
Describe the problem
Goal: to filter an existing data frame by a column in another data frame of the same name. I attempted both methods stated above. The filtering works in that some values are filtered. However, not all the values are filtered, some remain. The number of records that aren’t filtered out changes every run.
After a while I suspected that the parallelism in Modin might be causing things to go awry. I switched to pandas and wham, every record/row that should’ve been dropped was dropped/filtered. With pandas the number of records/rows that were dropped remained the same every time I ran the script.
I then switched back to Modin and used this statement (os.environ[“MODIN_CPUS”] = “1”) to limit the cores to 1 (to emulate pandas). And sure enough, the filtering works as expected (all records/rows that should be dropped/filtered were).
Any thoughts about this?
Source code / logs
IDK how to do this. I’m happy to add more info but will require some assistance.
Using ray, version 1.1.0 I saw this somewhere in the documentation and use it: ray.init( ignore_reinit_error=True, log_to_driver=False )
IPython 7.27.0 PyCharm community 2020.3 sqlalchemy 1.4.22
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 29 (4 by maintainers)
Hi @SaintRod ! You can create test database using pandas or modin to make a reproducer, see example below:
In this example
sqlite
database and portion of NYC taxi benchmark dataset were used and it works fine, so it looks like the issue you faced can occur with certain database engine or with some specific data. Can you please provide us some more details about your case? Maybe you can provide us with database name you use, or maybe you can modify dataset from example above to trigger your issue?Hi @SaintRod! No worries at all - I think we should probably just close this ticket for now, since we no longer know if it’s reproducible or not!
@SaintRod this error looks strange, but sometimes Ray engine can work incorrectly on Windows, please check reproducer with Dask engine or using Docker container to avoid different undesired side effects. Example dockerfile and reproducer script (slightly modified for proper operation with Dask):
Dockerfile
python_example.py
You can run it using docker command
docker run -it -v <path_to_dir_with_dockerfile_and_reproducer>:C:\my_data -w C:\my_data -e MODIN_ENGINE=dask python_example2 python python_example.py