text: Unable to download IWSLT datasets
π Bug
Describe the bug Unable to download IWSLT2016 or IWSLT2017 datasets.
To Reproduce Steps to reproduce the behavior:
from torchtext.datasets import IWSLT2016
train, valid, test = IWSLT2016()
src, tgt = next(iter(train))
The same error occurs when trying to use IWSLT2017
.
Expected behavior The program returns the next src, tgt
pair in the training data.
Screenshots Full error logs are in this gist.
Environment Included in gist above.
Additional context No additional context.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 19 (12 by maintainers)
As a temporary fix, Iβm just downloading the datasets manually via the links in the documentation:
IWSLT2016
IWSLT2017
Then you can put the downloaded
.tgz
file into the proper directory:~/.torchtext/cache/IWSLT2016/
for 2016 and similar for 2017.Then
torchtext
will recognize the files and not download from GDrive.@Nayef211 thanks, it does sound like exactly what Iβm observing with IWSLT.
But I tried what is suggested in #1735 with (note the order of end_caching here and in the original code):
I still get the same behaviour: the inner
load_from_tar()
never gets iterated over.I agree that message is cryptic in case of errors which is not timeout, I will change it to some sort of diagnosis URL to help users figure out if the pipeline is bad and there are real errors.
@lolzballs the caching issue you just mentioned seems to be related to https://github.com/pytorch/text/issues/1735.
cc @parmeet @VitalyFedyunin I wonder if this is caused by the cache inconsistency issue you mention here https://github.com/pytorch/text/issues/1735#issuecomment-1137723096 when using filters in our dataset logic.
I was under the impression that even when the
confirm_token
isNone
, the download can still be valid and work as intended. Hence, why #1620 was resolved. Is that incorrect? If that is true, we should not add that back into TorchData.