text: Multi30K dataset link is broken

The link to Multi30K dataset at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz is broken: https://github.com/pytorch/text/blob/73bf4fa8cedc12d910ab76190e446bd2e47a8325/torchtext/datasets/multi30k.py#L16

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 18 (4 by maintainers)

Commits related to this issue

Most upvoted comments

Found a local copy of the dataset and uploaded it to github (it’s rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k

Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience.

Thanks, @Nayef211, @rrmina !

No idea what’s exactly wrong with the data, the files above were located in ~/.torchtext/cache/Multi30k of one of my students.

I’ve tried to simply rename the archive (according to the name in torchtext docs) and files in it and change MD5 to the correct one and it seems to work.

Including the approach suggested by @Nayef211, which is way more elegant, the final algorithm should be the following:

from torchtext.datasets import multi30k, Multi30k

# Update URLs to point to data stored by user
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

# Update hash since there is a discrepancy between user hosted test split and that of the test split in the original dataset 
multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

data_train = Multi30k(split='train')
data_val = Multi30k(split='valid')
data_test = Multi30k(split='test')

Test data has 1000 sentences, which seems correct.

Plus, besides commenting the previous URL, you also need to change the MD5 in torchtext/datasets/multi30k.py.

# URL = {
#     'train': r'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz',
#     'valid': r'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz',
#     'test': r'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz',
# }
# 
# MD5 = {
#     'train': '20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e',
#     'valid': 'a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c',
#     'test': '0681be16a532912288a91ddd573594fbdd57c0fbb81486eff7c55247e35326c2',
# }

URL = {
    "train": r"https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz",
    "valid": r"https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz",
    "test": r"https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz",
}

MD5 = {
    "train": "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e",
    "valid": "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c",
    "test": "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36",
}

Please, refer to the next answer with updated example

Example code to make it work (tested on Colab):

!pip install torchdata
!mkdir -p ~/.torchtext/cache/Multi30k
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz

from torchtext.datasets import Multi30k
train_iter = Multi30k(split="train")

Thank for the instructions. I’ve had to manually extract the mmt16_task1_test.tar.gz file, as it wasn’t automatically handled by datasets.Multi30k for some reason. The mmt16 file contains multiple files, not just the expected test.en and test.de. Might be worth a note to save others some time!

It wasn’t automatically extracted because the mmt16_task1_test.tar.gz archive containes Apple metadata files ._test.de, ._test.en, and ._test.fr that matche the filter and are getting extracted instead. Would be good to fix the archive file, but meanwhile this patch for _filter_fn can help it to pick the correct file from the archive:

def _filter_fn(split, language_pair, i, x):
    return f"/{torchtext.datasets.multi30k._PREFIX[split]}.{language_pair[i]}" in x[0]
torchtext.datasets.multi30k._filter_fn = _filter_fn

Example code to make it work (tested on Colab):

!pip install torchdata

!mkdir -p ~/.torchtext/cache/Multi30k

!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz
!wget -P ~/.torchtext/cache/Multi30k https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz


# now everything works as intended
from torchtext.datasets import Multi30k
train_iter = Multi30k(split="train")

Just wanted to mention another approach to get Multi30k working with the data you are hosting @neychev. Rather than downloading the data directly using wget we can programmatically modify the URLs that each split of the dataset is being dowloaded from as follows:

from torchtext.datasets import multi30k, Multi30k

# Update URLs to point to data stored by user
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz"

# Update hash since there is a discrepancy between user hosted test split and that of the test split in the original dataset 
multi30k.MD5["test"] = "876a95a689a2a20b243666951149fd42d9bfd57cbbf8cd2c79d3465451564dd2"

dp = Multi30k(split='train')

As @rrmina mentioned earlier, this approach still doesn’t work with the test split. If I try to print the contents of the test split, I don’t get any outputs. @neychev do you happen to know what the discrepancy is for mmt16_task1_test.tar.gz between the original test split and the one you host?

As a next step, I also plan to update our Multi30k dataset implementation so we can rely on the data stored in https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k until the dataset in the original server is restored. This way we don’t need to rely on any of the above hacks to get this dataset working. 😄