datasets: `datasets` can't read a Parquet file in Python 3.9.13

Describe the bug

I have an error when trying to load this dataset (it’s private but I can add you to the bigcode org). datasets can’t read one of the parquet files in the Java subset

from datasets import load_dataset

ds = load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)

  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

It seems to be an issue with new Python versions, Because it works in these two environements:

- `datasets` version: 2.6.1
- Platform: Linux-5.4.0-131-generic-x86_64-with-glibc2.31
- Python version: 3.9.7
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

But not in this:

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

Steps to reproduce the bug

Load the dataset in python 3.9.13

Expected behavior

Load the dataset without the pyarrow error.

Environment info

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 16 (7 by maintainers)

Most upvoted comments

Cool !

But I thought if something went wrong with a download datasets creates new cache for all the files

We don’t perform integrity verifications if we don’t know in advance the hash of the file to download.

at some point I even changed dataset versions so it was still using that cache?

datasets caches the files by URL and ETag. If the content of a file changes, then the ETag changes and so it redownloads the file

lhoestq on Nov 22, 2022

I think you have to try them all 😕

Alternatively you can add a try/catch in parquet.py in datasets to raise the name of the file that fails at doing parquet_file = pq.ParquetFile(f) when you run your initial code

load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)

but it will still iterate on all the files until it fails

lhoestq on Nov 21, 2022