dask: Read/write to same file using `.to_parquet(overwrite=True)` in the same task graph should raise an error

Performing the operation .to_parquet produces an error if the Dask dataframe is loaded from a Dask file vs. if it is created within the code and not read from file.

What happened: Way 1 of the code below works as expected - it correctly saves the output to parquet. Way 2 of the below code, does NOT work as expected and fails with FileNotFoundError: [Errno 2] No such file or directory: 'error.parquet/part.1.parquet'

What you expected to happen: I expected Way 2 to work as Way 1

Minimal Complete Verifiable Example:

import numpy as np
import pandas as pd
import dask.dataframe as dd

# -------------
# way 1 - works
# -------------
print('way 1 - start')
A = np.random.rand(200,300)
cols = np.arange(0, A.shape[1])
cols = [str(col) for col in cols]
df = pd.DataFrame(A, columns=cols)
ddf = dd.from_pandas(df, npartitions=11)

# compute and resave
ddf.drop(cols[0:11], axis=1)
dd.to_parquet(
	ddf, 'error.parquet', engine='auto', compression='default',
	write_index=True, overwrite=True, append=False)
print('way 1 - end')

# ----------------------
# way 2 - does NOT work 
# ----------------------
print('way 2 - start')
ddf = dd.read_parquet('error.parquet')

# compute and resave
ddf.drop(cols[0:11], axis=1)
dd.to_parquet(
	ddf, 'error.parquet', engine='auto', compression='default',
	write_index=True, overwrite=True, append=False)
print('way 2 - end')

Environment:

Dask version: dask==2021.3.0
Python version: 3.8
Operating System: Mac OS Big Sur 11.2.1
Install method (conda, pip, source): pip

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 17 (7 by maintainers)

Commits related to this issue

Raise error when read and write happens to the same directory location in the same task graph Closes #7466 — committed to Madhu94/dask by Madhu94 3 years ago
Raise error when read and write happens to the same directory location in the same task graph Closes #7466 — committed to Madhu94/dask by Madhu94 3 years ago

Most upvoted comments

I’d like to pick this up!

Madhu94 on Apr 2, 2021

Another solution would be to raise an error when a user adds a task that will delete a file that is referenced by a different task in the graph. I wonder if @rjzamora or @crusaderky have ideas about what the behavior should be here.

One of our initial concerns about adding an overwrite= argument was that it should really be the users resoponsibility to deal with file/directory cleanup. We wanted to avoid the need to deal with issues exactly like this one 😃. I do think it makes sense to add an explicit check for a circular dependency, but we should avoid adding any special handling (temporary directory creation, etc) within Dask. If the user wants to read from a dataset and then overwrite it within the same graph execution, they should probably get an error. It should be up to the user to decide how to manage resources (i.e. create a temporary directory, or persist the collection in memory before overwriting the data).

BTW, if instead of using … I use … it seems to be faster.

That’s strange.

rjzamora on Mar 25, 2021

Another solution would be to raise an error when a user adds a task that will delete a file that is referenced by a different task in the graph. I wonder if @rjzamora or @crusaderky have ideas about what the behavior should be here.

jsignell on Mar 25, 2021

@crusaderky Thank you for the heads up - will try to implement what you suggested. Would you be able to provide a code snippet on how you would practically modify the above? Thank you!

dd.to_parquet(ddf, 'error.parquet.new', ...)
os.remove('error.parquet')
os.rename('error.parquet.new', 'error.parquet')

d = dd.to_parquet(ddf, 'error.parquet.new', compute=False, ...)
fut, = client.compute(d)
def rename(fut):
   os.remove('error.parquet')
   os.rename('error.parquet.new', 'error.parquet')
fut.add_done_callback(rename)
fut.result()

crusaderky on Apr 1, 2021

Yes, there is backup; thank you for confirming.

mengaldo on Mar 28, 2021

Ok like @rjzamora said I think the solution to this issue is:

If the user wants to read from a dataset and then overwrite it within the same graph execution, they should probably get an error

I am updating the title of this issue to reflect that decision.

jsignell on Mar 26, 2021

Thanks @jsignell @rjzamora I think that having overwrite= is a nice feature. Many deployments that we have use a combination of spark and dask, and having some consistency on how the two handle overwriting is good. I think that this issue can be solved, but given my short experience, you are much better placed than me to answer how

mengaldo on Mar 26, 2021