dask: Parquet `_metadata` too large to parse
I encountered the following error (the full traceback is further down):
OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached
when reading in a Parquet dataset on S3 with a large _metadata file. In this particular case the _metadata file was 676.5 MB. Based on the error message it looks like this _metadata file is too big for pyarrow to parse.
cc @jorisvandenbossche @rjzamora
Full traceback:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<timed exec> in <module>
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
314 gather_statistics = True
315
--> 316 read_metadata_result = engine.read_metadata(
317 fs,
318 paths,
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
540 split_row_groups,
541 gather_statistics,
--> 542 ) = cls._gather_metadata(
543 paths,
544 fs,
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, dataset_kwargs)
1038 if fs.exists(meta_path):
1039 # Use _metadata file
-> 1040 ds = pa_ds.parquet_dataset(
1041 meta_path,
1042 filesystem=fs,
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/dataset.py in parquet_dataset(metadata_path, schema, filesystem, format, partitioning, partition_base_dir)
492 )
493
--> 494 factory = ParquetDatasetFactory(
495 metadata_path, filesystem, format, options=options)
496 return factory.finish(schema)
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetDatasetFactory.__init__()
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 23 (18 by maintainers)
Thank you for tracking this down
On Wed, Aug 18, 2021 at 8:31 AM Joris Van den Bossche < @.***> wrote: