dask: Read/write to same file using `.to_parquet(overwrite=True)` in the same task graph should raise an error
Performing the operation .to_parquet produces an error if the Dask dataframe is loaded from a Dask file vs. if it is created within the code and not read from file.
What happened: Way 1 of the code below works as expected - it correctly saves the output to parquet. Way 2 of the below code, does NOT work as expected and fails with FileNotFoundError: [Errno 2] No such file or directory: 'error.parquet/part.1.parquet'
What you expected to happen: I expected Way 2 to work as Way 1
Minimal Complete Verifiable Example:
import numpy as np
import pandas as pd
import dask.dataframe as dd
# -------------
# way 1 - works
# -------------
print('way 1 - start')
A = np.random.rand(200,300)
cols = np.arange(0, A.shape[1])
cols = [str(col) for col in cols]
df = pd.DataFrame(A, columns=cols)
ddf = dd.from_pandas(df, npartitions=11)
# compute and resave
ddf.drop(cols[0:11], axis=1)
dd.to_parquet(
ddf, 'error.parquet', engine='auto', compression='default',
write_index=True, overwrite=True, append=False)
print('way 1 - end')
# ----------------------
# way 2 - does NOT work
# ----------------------
print('way 2 - start')
ddf = dd.read_parquet('error.parquet')
# compute and resave
ddf.drop(cols[0:11], axis=1)
dd.to_parquet(
ddf, 'error.parquet', engine='auto', compression='default',
write_index=True, overwrite=True, append=False)
print('way 2 - end')
Environment:
- Dask version: dask==2021.3.0
- Python version: 3.8
- Operating System: Mac OS Big Sur 11.2.1
- Install method (conda, pip, source): pip
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (7 by maintainers)
Commits related to this issue
- Raise error when read and write happens to the same directory location in the same task graph Closes #7466 — committed to Madhu94/dask by Madhu94 3 years ago
- Raise error when read and write happens to the same directory location in the same task graph Closes #7466 — committed to Madhu94/dask by Madhu94 3 years ago
I’d like to pick this up!
One of our initial concerns about adding an
overwrite=argument was that it should really be the users resoponsibility to deal with file/directory cleanup. We wanted to avoid the need to deal with issues exactly like this one 😃. I do think it makes sense to add an explicit check for a circular dependency, but we should avoid adding any special handling (temporary directory creation, etc) within Dask. If the user wants to read from a dataset and then overwrite it within the same graph execution, they should probably get an error. It should be up to the user to decide how to manage resources (i.e. create a temporary directory, or persist the collection in memory before overwriting the data).That’s strange.
Another solution would be to raise an error when a user adds a task that will delete a file that is referenced by a different task in the graph. I wonder if @rjzamora or @crusaderky have ideas about what the behavior should be here.
or
Yes, there is backup; thank you for confirming.
Ok like @rjzamora said I think the solution to this issue is:
I am updating the title of this issue to reflect that decision.
Thanks @jsignell @rjzamora I think that having
overwrite=is a nice feature. Many deployments that we have use a combination of spark and dask, and having some consistency on how the two handle overwriting is good. I think that this issue can be solved, but given my short experience, you are much better placed than me to answer how