datasets: `datasets` can't read a Parquet file in Python 3.9.13
Describe the bug
I have an error when trying to load this dataset (it’s private but I can add you to the bigcode org). datasets
can’t read one of the parquet files in the Java subset
from datasets import load_dataset
ds = load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
It seems to be an issue with new Python versions, Because it works in these two environements:
- `datasets` version: 2.6.1
- Platform: Linux-5.4.0-131-generic-x86_64-with-glibc2.31
- Python version: 3.9.7
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
But not in this:
- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
Steps to reproduce the bug
Load the dataset in python 3.9.13
Expected behavior
Load the dataset without the pyarrow error.
Environment info
- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (7 by maintainers)
Cool !
We don’t perform integrity verifications if we don’t know in advance the hash of the file to download.
datasets
caches the files by URL and ETag. If the content of a file changes, then the ETag changes and so it redownloads the fileI think you have to try them all 😕
Alternatively you can add a try/catch in
parquet.py
indatasets
to raise the name of the file that fails at doingparquet_file = pq.ParquetFile(f)
when you run your initial codebut it will still iterate on all the files until it fails