datasets: Checksums didn't match for dataset source
Dataset viewer issue for ‘wiki_lingua*’
Link: link to the dataset viewer page
data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]")
short description of the issue
[NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?export=download&id=11wMGqNVSwwk6zUnDaJEgm3qT71kAHeff']]()
Am I the one who added this dataset ? No
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 26 (10 by maintainers)
Hi! Installing the
datasets
package from master (pip install git+https://github.com/huggingface/datasets.git
) and then redownloading the datasets withdownload_mode
set toforce_redownload
(e.g.dataset = load_dataset("dbpedia_14", download_mode="force_redownload")
) should fix the issue.same for multi_news dataset
This is a super-common failure mode. We really need to find a better workaround. My solution was to wait until the owner of the dataset in question did the right thing, and then I had to delete my cached versions of the datasets with the bad checksums. I don’t understand why this happens. Would it be possible to maintain a copy of the most recent version that was known to work, and roll back to that automatically if the checksums fail? And if the checksums fail, couldn’t the system automatically flush the cached versions with the bad checksums? It feels like we are blaming the provider of the dataset, when in fact, there are things that the system could do to ease the pain. Let’s take these error messages seriously. There are too many of them involving too many different datasets.
exams
was fixed on 16 Aug by this PR:dart
has been transferred to the Hub: https://huggingface.co/datasets/dart/discussions/1multi_news
have been transferred to the Hub as well: https://huggingface.co/datasets/multi_news/discussions/1datasets
For information to everybody, we are removing the checksum verifications (that were creating a bad user experience). This will be in place in the following weeks.
We have fixed the issues with the datasets:
@afcruzs-ms I think your issue is a different one, because that dataset is not hosted at Google Drive. Would you mind open another issue for that other problem, please? Thanks! 😃
Hi @rafikg, I think that is another different issue. Let me check it…
I guess maybe you are using a different Python version that the one the dataset owner used to create the pickle file…
I can see this problem too in xcopa, unfortunately installing the latest master (1.18.4.dev0) doesn’t work, @albertvillanova .
Throws
I think this is a side-effect of #3787. The checksums won’t match because the URLs have changed. @rafikg @Y0mingZhang, while this is fixed, maybe you can load the datasets as such:
data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]", ignore_verifications=True)
dataset = load_dataset("dbpedia_14", ignore_verifications=True)
This will, most probably, skip the verifications and integrity checks listed here