datasets: [GSoC] Update checksums for missing urls

Thanks to https://github.com/tensorflow/datasets/commit/5a8ee376c39a885c9132cdd22a8b2669febe6366 (https://github.com/tensorflow/datasets/issues/1397), we now check that all urls from dl_manager.download() are correctly registered.

However, some of the urls are not registered for some of our datasets. We should generate the checksums for the missing urls by running download_and_prepare script with --register_checksums as per https://www.tensorflow.org/datasets/add_dataset#2_run_download_and_prepare_locally.

Note: The dataset don’t need to be fully generated, so you can modify the dataset implementation to only download the files.

The datasets to updates are the ones with SKIP_CHECKSUMS = True. To know which urls are missing, just run the tests and the error message should show the missing urls.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 19 (19 by maintainers)

Most upvoted comments

No, it should work fine for both deleted and modified files. See link

@Eshan-Agarwal Nice catch, I’ll update this internally when merging the dataset.