datasets: Checksums didn't match for dataset source

Dataset viewer issue for ‘wiki_lingua*’

Link: link to the dataset viewer page

data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]") short description of the issue

[NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?export=download&id=11wMGqNVSwwk6zUnDaJEgm3qT71kAHeff']]()

Am I the one who added this dataset ? No

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 26 (10 by maintainers)

Most upvoted comments

Hi! Installing the datasets package from master (pip install git+https://github.com/huggingface/datasets.git) and then redownloading the datasets with download_mode set to force_redownload (e.g. dataset = load_dataset("dbpedia_14", download_mode="force_redownload")) should fix the issue.

mariosasko on Feb 27, 2022

same for multi_news dataset

Carol-gutianle on Oct 11, 2022

This is a super-common failure mode. We really need to find a better workaround. My solution was to wait until the owner of the dataset in question did the right thing, and then I had to delete my cached versions of the datasets with the bad checksums. I don’t understand why this happens. Would it be possible to maintain a copy of the most recent version that was known to work, and roll back to that automatically if the checksums fail? And if the checksums fail, couldn’t the system automatically flush the cached versions with the bad checksums? It feels like we are blaming the provider of the dataset, when in fact, there are things that the system could do to ease the pain. Let’s take these error messages seriously. There are too many of them involving too many different datasets.

kwchurch on Jul 11, 2022

@thesofakillers the issue with exams was fixed on 16 Aug by this PR:
- #4853
@Aktsvigun the issue with dart has been transferred to the Hub: https://huggingface.co/datasets/dart/discussions/1
- and fixed by PR: https://huggingface.co/datasets/dart/discussions/2
@Carol-gutianle the issue with multi_news have been transferred to the Hub as well: https://huggingface.co/datasets/multi_news/discussions/1
- not reproducible: maybe you should try to update datasets

For information to everybody, we are removing the checksum verifications (that were creating a bad user experience). This will be in place in the following weeks.

albertvillanova on Oct 12, 2022

We have fixed the issues with the datasets:

wider_face: by hosting their data files on the HuggingFace Hub (CC: @HosseynGT)
fever: by updating to their new data URLs (CC: @MoritzLaurer)

albertvillanova on Jun 13, 2022

@afcruzs-ms I think your issue is a different one, because that dataset is not hosted at Google Drive. Would you mind open another issue for that other problem, please? Thanks! 😃

albertvillanova on Mar 2, 2022

Hi @rafikg, I think that is another different issue. Let me check it…

I guess maybe you are using a different Python version that the one the dataset owner used to create the pickle file…

albertvillanova on Mar 2, 2022

I can see this problem too in xcopa, unfortunately installing the latest master (1.18.4.dev0) doesn’t work, @albertvillanova .

from datasets import load_dataset
dataset = load_dataset("xcopa", "it")

Throws

in verify_checksums(expected_checksums, recorded_checksums, verification_name)
     38     if len(bad_urls) > 0:
     39         error_msg = "Checksums didn't match" + for_verification_name + ":\n"
---> 40         raise NonMatchingChecksumError(error_msg + str(bad_urls))
     41     logger.info("All the checksums matched successfully" + for_verification_name)
     42 

NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://github.com/cambridgeltl/xcopa/archive/master.zip']

afcruzs-ms on Mar 2, 2022

I think this is a side-effect of #3787. The checksums won’t match because the URLs have changed. @rafikg @Y0mingZhang, while this is fixed, maybe you can load the datasets as such:

data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]", ignore_verifications=True) dataset = load_dataset("dbpedia_14", ignore_verifications=True)

This will, most probably, skip the verifications and integrity checks listed here

AngadSethi on Feb 26, 2022