dask: Pyarrow metadata `RuntimeError` in `to_parquet`
Offline a user reported getting RuntimeError: file metadata is only available after writer close when writing a Dask DataFrame to parquet with our pyarrow engine. The traceback they were presented with was:
Traceback (most recent call last):
File "example.py", line 349, in <module>
main(date_dict, example_conf)
File "example.py", line 338, in main
make_example_datasets(
File "example.py", line 311, in make_example_datasets
default_to_parquet(sub_ddf, v["path"], engine="pyarrow", overwrite=True)
File "example.py", line 232, in default_to_parquet
ddf.to_parquet(path=path, engine=engine, overwrite=overwrite, write_metadata_file=False)
File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 4453, in to_parquet
return to_parquet(self, path, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 721, in to_parquet
out = out.compute(**compute_kwargs)
File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 286, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 568, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2743, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2020, in gather
return self.sync(
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 861, in sync
return sync(
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
raise exc.with_traceback(tb)
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
result[0] = yield future
File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1885, in _gather
raise exception.with_traceback(traceback)
File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 947, in write_partition
pq.write_table(
File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 1817, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 662, in __exit__
self.close()
File "/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py", line 684, in close
self._metadata_collector.append(self.writer.metadata)
File "pyarrow/_parquet.pyx", line 1434, in pyarrow._parquet.ParquetWriter.metadata.__get__
RuntimeError: file metadata is only available after writer close
cc @rjzamora in case you’ve seen this before or have an idea of what might be causing this
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 26 (13 by maintainers)
Thanks for the reproducer! I can reproduce it with the above dask example, but if I try to extract the relevant pyarrow example, I don’t see the failure:
(I get the correct error about “Casting from timestamp[ns] to timestamp[us] would lose data”, and the
metadata_collectoractually gets filled with a FileMetaData object)Would the fact that it is executed in threads when using dask influence it somehow?
So if it fixes the error for you, we can certainly apply the patch. But it would be nice to have a reproducer for our own test suite as well that doesn’t rely on dask.
Unfortunately not, I thought I had it, and it went away again…
I tried a small reproducer with a public S3 bucket for which I don’t have write permissions:
for the latest s3fs that gives a “PermissionError: Access Denied” / “ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied”. But trying the older s3fs 0.4.2 as you listed, I see the RuntimeError as well (although also AccessDenied).
Full traceback
@kinghuang could you try with a more recent version of s3fs to see if that gives a more informative error message? You mentioned https://github.com/dask/dask/issues/6782 as a reason to have pinned s3fs to 0.4.2, but I think those issues should be resolved in the latest versions of s3fs / fsspec / pyarrow.