text: Unable to download IWSLT datasets

πŸ› Bug

Describe the bug Unable to download IWSLT2016 or IWSLT2017 datasets.

To Reproduce Steps to reproduce the behavior:

from torchtext.datasets import IWSLT2016
train, valid, test = IWSLT2016()
src, tgt = next(iter(train))

The same error occurs when trying to use IWSLT2017.

Expected behavior The program returns the next src, tgt pair in the training data.

Screenshots Full error logs are in this gist.

Environment Included in gist above.

Additional context No additional context.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

As a temporary fix, I’m just downloading the datasets manually via the links in the documentation:

Then you can put the downloaded .tgz file into the proper directory: ~/.torchtext/cache/IWSLT2016/ for 2016 and similar for 2017.

Then torchtext will recognize the files and not download from GDrive.

@Nayef211 thanks, it does sound like exactly what I’m observing with IWSLT.

But I tried what is suggested in #1735 with (note the order of end_caching here and in the original code):

def _filter_clean_cache(cache_decompressed_dp, full_filepath, uncleaned_filename):

    cache_inner_decompressed_dp = cache_decompressed_dp.on_disk_cache(
        filepath_fn=partial(_return_full_filepath, full_filepath)
    )
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.open_files(mode="b").load_from_tar()
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.filter(partial(_filter_filename_fn, uncleaned_filename))
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.map(partial(_clean_files_wrapper, full_filepath))
    return cache_inner_decompressed_dp

I still get the same behaviour: the inner load_from_tar() never gets iterated over.

Also I am not very clear what the timeout here is really doing. Per the doc Integer value of seconds to wait for uncached item to be written to disk, it seems it is the time waiting to download the file. But reducing the time to very small value (1 second) doesn’t impact anything for me even though the download takes longer. I suspect if it has to do something with file locks?

Also I wonder if this and the issue #1747 are somehow linked?

cc: @VitalyFedyunin

I agree that message is cryptic in case of errors which is not timeout, I will change it to some sort of diagnosis URL to help users figure out if the pipeline is bad and there are real errors.

@lolzballs the caching issue you just mentioned seems to be related to https://github.com/pytorch/text/issues/1735.

cc @parmeet @VitalyFedyunin I wonder if this is caused by the cache inconsistency issue you mention here https://github.com/pytorch/text/issues/1735#issuecomment-1137723096 when using filters in our dataset logic.

do you think we could align the Error between two versions of TorchText?

I think one way to achieve this would be to go through the same error tracing as provided in torchtext download hook for google drive . I am not exactly sure why this error message is removed from the implementation in GDriveReader here when confirm_token is None?

I was under the impression that even when the confirm_token is None, the download can still be valid and work as intended. Hence, why #1620 was resolved. Is that incorrect? If that is true, we should not add that back into TorchData.