vision: Unable to load CelebA dataset. File is not zip file error.

🐛 Bug

Unable to download and load celeba dataset into a loader.

To Reproduce

Try to load CeleBA dataset with download true returns error

batch_size=25
train_loader = torch.utils.data.DataLoader(
        datasets.CelebA('../data', split="train", download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.5,), (0.5,))
                       ])),
        batch_size=batch_size, shuffle=True)

Returns

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in __init__(self, root, split, target_type, transform, target_transform, download)
     64 
     65         if download:
---> 66             self.download()
     67 
     68         if not self._check_integrity():

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in download(self)
    118             download_file_from_google_drive(file_id, os.path.join(self.root, self.base_folder), filename, md5)
    119 
--> 120         with zipfile.ZipFile(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"), "r") as f:
    121             f.extractall(os.path.join(self.root, self.base_folder))
    122 

/usr/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
   1129         try:
   1130             if mode == 'r':
-> 1131                 self._RealGetContents()
   1132             elif mode in ('w', 'x'):
   1133                 # set the modified flag so central directory gets written

/usr/lib/python3.6/zipfile.py in _RealGetContents(self)
   1196             raise BadZipFile("File is not a zip file")
   1197         if not endrec:
-> 1198             raise BadZipFile("File is not a zip file")
   1199         if self.debug > 1:
   1200             print(endrec)

BadZipFile: File is not a zip file

Environment

PyTorch version: 1.5.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.12.0

Python version: 3.6

Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 418.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:

[pip3] numpy==1.18.4
[pip3] torch==1.5.0+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.6.0+cu101

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 23
Comments: 24 (3 by maintainers)

Most upvoted comments

This is still an issue FYI

+22

import-antigravity on Mar 16, 2021

Problem still exists. (Jun 14)

Ji-Xinyou on Jun 14, 2022

This issue is still persisting, is there a way to get the dataset and load it just like we would through torchvision.datasets

marzmesas on Mar 29, 2022

Seems this is a known issue, but wanted to raise this again as per @pmeier 's comment. I didn’t want to open another ticket on this though.

jotterbach on Aug 7, 2020

same

FrancescoSaverioZuppichini on Apr 16, 2021

This has nothing to do with the loader. We can get the same result with

from torchvision import datasets

dataset = datasets.CelebA(".", split="train", download=True,)

The underlying problem was reported in #1920: Google Drive has a daily maximum quota for any file, which seems to be exceeded for the CelebA files. You can see this in the response which is mindlessly written to every .txt and also .zip file.

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=&#47;static&#47;doclist&#47;client&#47;css&#47;1659352109&#45;untrustedcontent.css rel="stylesheet"><link rel="icon" href="https://ssl.gstatic.com/docs/doclist/images/infinite_arrow_favicon_4.ico"/><style nonce="0AwDvc7jesmreq9s3Zkdcw">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><script nonce="0AwDvc7jesmreq9s3Zkdcw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.de/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.de/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.de/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=DE&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 href="https://www.google.com/calendar?tab=oc">Calendar</a> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.de/intl/en/about/products?tab=oh"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank  href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://docs.google.com/uc%3Fexport%3Ddownload%26id%3D0B7EVK8r0v71pY0NSMzRuSXJEVkk&service=writely" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can&#39;t view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">&copy; 2020 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&amp;answer=2450387">Privacy & Terms</a></div></body></html>

@ajayrfhp The only “solution” we can offer is to tell you to wait and try again, since we have no control about your issue. You can ask the author of the dataset to host it on a platform that does not have daily quotas. If you do and he goes through with your proposal please inform us so that we can adapt our code.

@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.

pmeier on May 27, 2020

The problem still exists.

ozturkoktay on Sep 22, 2023

Hello everyone! Based on this discussion, this steps can help you (for me they perfectly worked):

Create directory named celeba and download to it all files from CelebA google Drive mentioned in this file_list
Unzip img_align_celeba.zip in ./celeba directory (I’m not sure if you should delete zip-file after unpacking)
And run the code necessarily with download=False parameter:

import torchvision.datasets as dset
img_path = './celeba'
data = dset.celeba.CelebA(root=img_path, split="train", target_type='attr', transform=None, download=False)

This tutorial worked for me!

univanxx on Sep 2, 2022

I would just like to add that the authors also include a Baidu drive you can download the data from on their website. The dataset is also available on Kaggle.

AndrewUlmer on Mar 17, 2021

This was fixed in #4109, but the commit is not yet included in a stable release. It will be in the upcoming one.

pmeier on Sep 30, 2021

@fmassa I suggest we wait for another issue raising this problem. At least I won’t check daily if this quota is exceeded. If there is another issue for this and I miss it or you somehow find a day when we can fix this feel free to tag me in. I’ll see what I can do.

pmeier on Jun 1, 2020