datasets: Checksums didn't match for dataset source

Dataset viewer issue for ‘wiki_lingua*’

Link: link to the dataset viewer page

data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]") short description of the issue

[NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?export=download&id=11wMGqNVSwwk6zUnDaJEgm3qT71kAHeff']]()

Am I the one who added this dataset ? No

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 26 (10 by maintainers)

Most upvoted comments

Hi! Installing the datasets package from master (pip install git+https://github.com/huggingface/datasets.git) and then redownloading the datasets with download_mode set to force_redownload (e.g. dataset = load_dataset("dbpedia_14", download_mode="force_redownload")) should fix the issue.

same for multi_news dataset

This is a super-common failure mode. We really need to find a better workaround. My solution was to wait until the owner of the dataset in question did the right thing, and then I had to delete my cached versions of the datasets with the bad checksums. I don’t understand why this happens. Would it be possible to maintain a copy of the most recent version that was known to work, and roll back to that automatically if the checksums fail? And if the checksums fail, couldn’t the system automatically flush the cached versions with the bad checksums? It feels like we are blaming the provider of the dataset, when in fact, there are things that the system could do to ease the pain. Let’s take these error messages seriously. There are too many of them involving too many different datasets.

For information to everybody, we are removing the checksum verifications (that were creating a bad user experience). This will be in place in the following weeks.

We have fixed the issues with the datasets:

  • wider_face: by hosting their data files on the HuggingFace Hub (CC: @HosseynGT)
  • fever: by updating to their new data URLs (CC: @MoritzLaurer)

@afcruzs-ms I think your issue is a different one, because that dataset is not hosted at Google Drive. Would you mind open another issue for that other problem, please? Thanks! 😃

Hi @rafikg, I think that is another different issue. Let me check it…

I guess maybe you are using a different Python version that the one the dataset owner used to create the pickle file…

I can see this problem too in xcopa, unfortunately installing the latest master (1.18.4.dev0) doesn’t work, @albertvillanova .

from datasets import load_dataset
dataset = load_dataset("xcopa", "it")

Throws

in verify_checksums(expected_checksums, recorded_checksums, verification_name)
     38     if len(bad_urls) > 0:
     39         error_msg = "Checksums didn't match" + for_verification_name + ":\n"
---> 40         raise NonMatchingChecksumError(error_msg + str(bad_urls))
     41     logger.info("All the checksums matched successfully" + for_verification_name)
     42 

NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://github.com/cambridgeltl/xcopa/archive/master.zip']

I think this is a side-effect of #3787. The checksums won’t match because the URLs have changed. @rafikg @Y0mingZhang, while this is fixed, maybe you can load the datasets as such:

data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]", ignore_verifications=True) dataset = load_dataset("dbpedia_14", ignore_verifications=True)

This will, most probably, skip the verifications and integrity checks listed here