dask: Parquet `_metadata` too large to parse

I encountered the following error (the full traceback is further down):

OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached

when reading in a Parquet dataset on S3 with a large _metadata file. In this particular case the _metadata file was 676.5 MB. Based on the error message it looks like this _metadata file is too big for pyarrow to parse.

cc @jorisvandenbossche @rjzamora

Full traceback:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<timed exec> in <module>

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    314         gather_statistics = True
    315 
--> 316     read_metadata_result = engine.read_metadata(
    317         fs,
    318         paths,

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    540             split_row_groups,
    541             gather_statistics,
--> 542         ) = cls._gather_metadata(
    543             paths,
    544             fs,

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, dataset_kwargs)
   1038             if fs.exists(meta_path):
   1039                 # Use _metadata file
-> 1040                 ds = pa_ds.parquet_dataset(
   1041                     meta_path,
   1042                     filesystem=fs,

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/dataset.py in parquet_dataset(metadata_path, schema, filesystem, format, partitioning, partition_base_dir)
    492     )
    493 
--> 494     factory = ParquetDatasetFactory(
    495         metadata_path, filesystem, format, options=options)
    496     return factory.finish(schema)

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetDatasetFactory.__init__()

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 23 (18 by maintainers)

Most upvoted comments

Thank you for tracking this down

On Wed, Aug 18, 2021 at 8:31 AM Joris Van den Bossche < @.***> wrote:

OK, so indeed this is related to the thrift version. In my local development environment, I still had libthrift 0.13, with which it works fine as mentioned above. But the latest releases of pyarrow on conda-forge are build with libthrift 0.14. Creating a new local environment with thrift 0.14 and building master of pyarrow with that, I also get the error.

It seems that Apache Thrift introduced a MaxMessageSize configuration option ( https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) in version 0.14 (https://issues.apache.org/jira/browse/THRIFT-5237).

Fastparquet uses the python bindings of thrift, and it seems that is still at version 0.13 and not yet updated to 0.14 ( https://github.com/conda-forge/thrift-feedstock/). So I suppose this is the reason that the file could be created with the fastparquet engine.

If you have an environment with Arrow built against libthrift 0.13, the file can be read fine. But, with conda-forge, the recent releases are built against 0.14, and you can only get pyarrow 3.0 or lower with libthrift=0.13.

I opened https://issues.apache.org/jira/browse/ARROW-13655 on the Arrow side. I suppose we could expose this configuration option on the Arrow side (but that’s not a short-term workaround).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/8027#issuecomment-901117860, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTETQUYDLH63PSF5KWTT5OY2VANCNFSM5CCGJUSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

mrocklin on Aug 18, 2021