tensorflow: Cannot download celeb_a dataset from tensorflow_datasets :(

Running this

(train_data, test_data), info = tfds.load(name = 'celeb_a', split = ['train', 'test'], as_supervised = True, shuffle_files = True, with_info = True)

Gives this

NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pZjFTYXZWM3FlRnM, downloaded to /root/tensorflow_datasets/downloads/ucexport_download_id_0B7EVK8r0v71pZjFTYXZWM3FlDDaXUAQO8EGH_a7VqGNLRtW52mva1LzDrb-V723OQN8.tmp.4ec0de7ede1541dca88a21190e298882/uc, has wrong checksum.

As far as I know, this issue is there because the dataset is on the google drive.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 3
Comments: 17 (1 by maintainers)

Most upvoted comments

You can download from source if you’re getting the same problem.

lordtt13 on Jul 18, 2020

I think I have solved this problem (tested with TFDS V 4.9.2). First, download the source code. Actually I downloaded the whole celeb_a directory. Then you will see Line 31 in the file celeb_a_dataset_builder.py is import tensorflow_datasets.public_api as tfds. Here I changed this line with import tensorflow_datasets as tfds Next, build this dataset manually. Open your terminal and cd to celeb_a directory where you download the source code and run commond tfds build celeb_a. Surprisingly, this dataset will be downloaded automatically : )

AmonB on Jun 23, 2023

Posting an alternative solution here since (1) automatically downloading CelebA still doesn’t work, and (2) I found the above solution too manual. This solution (tested with TFDS v3.2.1) simply overrides a method of the CelebA dataset builder to use manually downloaded data.

Step 1. Manually download data. This should get you the files {DATA_DIR}/img_align_celeba.zip, {DATA_DIR}/list_eval_partition.txt, {DATA_DIR}/list_landmarks_align_celeba.txt, and {DATA_DIR}/list_attr_celeba.txt . The following links are provided in the source code for the CelebA dataset builder.

img_align_celeba.zip: https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pZjFTYXZWM3FlRnM
list_eval_partition.txt: https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pY0NSMzRuSXJEVkk
list_landmarks_align_celeba.txt: https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pd0FJY3Blby1HUTQ
list_attr_celeba.txt: https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pblRyaVFSWGxPY0U

Step 2. Override the _split_generators function of the CelebA builder class. Then call download_and_prepare. Full code is provided below.

import tensorflow_datasets as tfds

DATA_DIR = "~/my_data_dir"

class CelebA(tfds.image.celeba.CelebA):
  def _split_generators(self, dl_manager):
    
    downloaded_dirs = {
      "img_align_celeba": DATA_DIR + "img_align_celeba.zip",
      "list_eval_partition": DATA_DIR + "list_eval_partition.txt",
      "list_attr_celeba": DATA_DIR + "list_attr_celeba.txt",
      "landmarks_celeba": DATA_DIR + "list_landmarks_align_celeba.txt",
    }

    # Load all images in memory (~1 GiB)
    # Use split to convert: `img_align_celeba/000005.jpg` -> `000005.jpg`
    all_images = {
        os.path.split(k)[-1]: img
        for k, img in dl_manager.iter_archive(
            downloaded_dirs["img_align_celeba"]
        )
    }

    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "file_id": 0,
                "downloaded_dirs": downloaded_dirs,
                "downloaded_images": all_images,
            },
        ),
        tfds.core.SplitGenerator(
            name=tfds.Split.VALIDATION,
            gen_kwargs={
                "file_id": 1,
                "downloaded_dirs": downloaded_dirs,
                "downloaded_images": all_images,
            },
        ),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={
                "file_id": 2,
                "downloaded_dirs": downloaded_dirs,
                "downloaded_images": all_images,
            },
        ),
    ]

builder = CelebA()
builder.download_and_prepare()

You can now call tfds.load('celeb_a') or builder = tfds.builder('celeb_a'); builder.download_and_prepare() and reuse the prepared dataset.

berthyf96 on Apr 4, 2023

Hi there, I think there is a way and I’ve managed to do what I wanted. Here’s my code:-

image_paths = sorted(glob.glob(os.path.join('dataset', 'img_align_celeba', 'img_align_celeba', '*.jpg')))

df = pd.read_csv('dataset/list_attr_celeba.csv')
df.replace(to_replace = -1, value = 0, inplace = True)
labels = df.iloc[:, 1:].values

print(image_paths[:2])
print(labels[:2])
# prints
'''
['dataset/img_align_celeba/img_align_celeba/000001.jpg', 'dataset/img_align_celeba/img_align_celeba/000002.jpg']
[[0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 0
  1 0 0 1]
 [0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 1]]
'''

# here's the tensorflow part...
@tf.function
def read_image(image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_image(image, channels = 3, dtype = tf.float32)
    return image

@tf.function
def normalize(image):
    image = (image - tf.reduce_min(image))/(tf.reduce_max(image) - tf.reduce_min(image))
    image = (2 * image) - 1
    return image

@tf.function
def augment(image):
    image = tf.image.random_crop(image, (178, 178, 3))
    image = tf.image.resize(image, (256, 256))
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_saturation(image, 0.5, 2.0)
    image = tf.image.random_brightness(image, 0.5)
    return image

@tf.function
def preprocess(image_path, label):
    image = read_image(image_path)
    image = augment(image)
    image = normalize(image)
    return image, label

dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels))
dataset = dataset.map(preprocess, num_parallel_calls = tf.data.experimental.AUTOTUNE)
dataset = dataset.shuffle(buffer_size = 1024)
dataset = dataset.batch(batch_size = 128)
dataset = dataset.prefetch(buffer_size = tf.data.experimental.AUTOTUNE)

for x, y in dataset:
    break

print(x.shape, y.shape)
# prints
# (128, 256, 256, 3) (128, 40)

With this code, everything is working just fine for now. Also, do share your opinion for reading the data based on the above code, like performance and memory wise, and some modifications that I should make if any. And if everything is okay, then please add this example to the official tensorflow documentation, it’d very helpful for others as well.

Now one thing I’d still want to achieve, in the the augment function, I can’t get the dimensions of the image, like if I run

print(min(image.shape[:-1])) # :-1 to ignore the channels

inside this augment function to get the minimum of height and width, then it gives me <unknown>. I want to replace this line

“image = tf.image.random_crop(image, (178, 178, 3))” with this

min_dim = min(image.shape[:-1])
image = tf.image.random_crop(image, (min_dim, min_dim, 3))

I tried removed the @tf.function decorator, but still not working.

So is a there a way I can get this last part done?

Thanks for your time.

braindotai on Jul 20, 2020