vision: Unable to load CelebA dataset. File is not zip file error.

🐛 Bug

Unable to download and load celeba dataset into a loader.

To Reproduce

  1. Try to load CeleBA dataset with download true returns error
batch_size=25
train_loader = torch.utils.data.DataLoader(
        datasets.CelebA('../data', split="train", download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.5,), (0.5,))
                       ])),
        batch_size=batch_size, shuffle=True)

Returns

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in __init__(self, root, split, target_type, transform, target_transform, download)
     64 
     65         if download:
---> 66             self.download()
     67 
     68         if not self._check_integrity():

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in download(self)
    118             download_file_from_google_drive(file_id, os.path.join(self.root, self.base_folder), filename, md5)
    119 
--> 120         with zipfile.ZipFile(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"), "r") as f:
    121             f.extractall(os.path.join(self.root, self.base_folder))
    122 

/usr/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
   1129         try:
   1130             if mode == 'r':
-> 1131                 self._RealGetContents()
   1132             elif mode in ('w', 'x'):
   1133                 # set the modified flag so central directory gets written

/usr/lib/python3.6/zipfile.py in _RealGetContents(self)
   1196             raise BadZipFile("File is not a zip file")
   1197         if not endrec:
-> 1198             raise BadZipFile("File is not a zip file")
   1199         if self.debug > 1:
   1200             print(endrec)

BadZipFile: File is not a zip file

Environment

  • PyTorch version: 1.5.0+cu101

  • Is debug build: No

  • CUDA used to build PyTorch: 10.1

  • OS: Ubuntu 18.04.3 LTS

  • GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

  • CMake version: version 3.12.0

Python version: 3.6

  • Is CUDA available: Yes
  • CUDA runtime version: 10.1.243
  • GPU models and configuration: GPU 0: Tesla T4
  • Nvidia driver version: 418.67
  • cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:

  • [pip3] numpy==1.18.4
  • [pip3] torch==1.5.0+cu101
  • [pip3] torchsummary==1.5.1
  • [pip3] torchtext==0.3.1
  • [pip3] torchvision==0.6.0+cu101

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 23
  • Comments: 24 (3 by maintainers)

Most upvoted comments

This is still an issue FYI

Problem still exists. (Jun 14)

This issue is still persisting, is there a way to get the dataset and load it just like we would through torchvision.datasets

Seems this is a known issue, but wanted to raise this again as per @pmeier 's comment. I didn’t want to open another ticket on this though.

This has nothing to do with the loader. We can get the same result with

from torchvision import datasets

dataset = datasets.CelebA(".", split="train", download=True,)

The underlying problem was reported in #1920: Google Drive has a daily maximum quota for any file, which seems to be exceeded for the CelebA files. You can see this in the response which is mindlessly written to every .txt and also .zip file.

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=&#47;static&#47;doclist&#47;client&#47;css&#47;1659352109&#45;untrustedcontent.css rel="stylesheet"><link rel="icon" href="https://ssl.gstatic.com/docs/doclist/images/infinite_arrow_favicon_4.ico"/><style nonce="0AwDvc7jesmreq9s3Zkdcw">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><script nonce="0AwDvc7jesmreq9s3Zkdcw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.de/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.de/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.de/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=DE&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 href="https://www.google.com/calendar?tab=oc">Calendar</a> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.de/intl/en/about/products?tab=oh"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank  href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://docs.google.com/uc%3Fexport%3Ddownload%26id%3D0B7EVK8r0v71pY0NSMzRuSXJEVkk&service=writely" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can&#39;t view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">&copy; 2020 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&amp;answer=2450387">Privacy & Terms</a></div></body></html>

@ajayrfhp The only “solution” we can offer is to tell you to wait and try again, since we have no control about your issue. You can ask the author of the dataset to host it on a platform that does not have daily quotas. If you do and he goes through with your proposal please inform us so that we can adapt our code.

@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.

The problem still exists.

Hello everyone! Based on this discussion, this steps can help you (for me they perfectly worked):

  1. Create directory named celeba and download to it all files from CelebA google Drive mentioned in this file_list
  2. Unzip img_align_celeba.zip in ./celeba directory (I’m not sure if you should delete zip-file after unpacking)
  3. And run the code necessarily with download=False parameter:
import torchvision.datasets as dset
img_path = './celeba'
data = dset.celeba.CelebA(root=img_path, split="train", target_type='attr', transform=None, download=False)

This tutorial worked for me!

I would just like to add that the authors also include a Baidu drive you can download the data from on their website. The dataset is also available on Kaggle.

This was fixed in #4109, but the commit is not yet included in a stable release. It will be in the upcoming one.

@fmassa I suggest we wait for another issue raising this problem. At least I won’t check daily if this quota is exceeded. If there is another issue for this and I miss it or you somehow find a day when we can fix this feel free to tag me in. I’ll see what I can do.