tensorflow: Cannot download celeb_a dataset from tensorflow_datasets :(

Running this

(train_data, test_data), info = tfds.load(name = 'celeb_a', split = ['train', 'test'], as_supervised = True, shuffle_files = True, with_info = True)

Gives this

NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0B7EVK8r0v71pZjFTYXZWM3FlRnM, downloaded to /root/tensorflow_datasets/downloads/ucexport_download_id_0B7EVK8r0v71pZjFTYXZWM3FlDDaXUAQO8EGH_a7VqGNLRtW52mva1LzDrb-V723OQN8.tmp.4ec0de7ede1541dca88a21190e298882/uc, has wrong checksum.

As far as I know, this issue is there because the dataset is on the google drive.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 17 (1 by maintainers)

Most upvoted comments

You can download from source if you’re getting the same problem.

I think I have solved this problem (tested with TFDS V 4.9.2). First, download the source code. Actually I downloaded the whole celeb_a directory. Then you will see Line 31 in the file celeb_a_dataset_builder.py is import tensorflow_datasets.public_api as tfds. Here I changed this line with import tensorflow_datasets as tfds Next, build this dataset manually. Open your terminal and cd to celeb_a directory where you download the source code and run commond tfds build celeb_a. Surprisingly, this dataset will be downloaded automatically : )

Posting an alternative solution here since (1) automatically downloading CelebA still doesn’t work, and (2) I found the above solution too manual. This solution (tested with TFDS v3.2.1) simply overrides a method of the CelebA dataset builder to use manually downloaded data.

Step 1. Manually download data. This should get you the files {DATA_DIR}/img_align_celeba.zip, {DATA_DIR}/list_eval_partition.txt, {DATA_DIR}/list_landmarks_align_celeba.txt, and {DATA_DIR}/list_attr_celeba.txt . The following links are provided in the source code for the CelebA dataset builder.

Step 2. Override the _split_generators function of the CelebA builder class. Then call download_and_prepare. Full code is provided below.

import tensorflow_datasets as tfds

DATA_DIR = "~/my_data_dir"

class CelebA(tfds.image.celeba.CelebA):
  def _split_generators(self, dl_manager):
    
    downloaded_dirs = {
      "img_align_celeba": DATA_DIR + "img_align_celeba.zip",
      "list_eval_partition": DATA_DIR + "list_eval_partition.txt",
      "list_attr_celeba": DATA_DIR + "list_attr_celeba.txt",
      "landmarks_celeba": DATA_DIR + "list_landmarks_align_celeba.txt",
    }

    # Load all images in memory (~1 GiB)
    # Use split to convert: `img_align_celeba/000005.jpg` -> `000005.jpg`
    all_images = {
        os.path.split(k)[-1]: img
        for k, img in dl_manager.iter_archive(
            downloaded_dirs["img_align_celeba"]
        )
    }

    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                "file_id": 0,
                "downloaded_dirs": downloaded_dirs,
                "downloaded_images": all_images,
            },
        ),
        tfds.core.SplitGenerator(
            name=tfds.Split.VALIDATION,
            gen_kwargs={
                "file_id": 1,
                "downloaded_dirs": downloaded_dirs,
                "downloaded_images": all_images,
            },
        ),
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={
                "file_id": 2,
                "downloaded_dirs": downloaded_dirs,
                "downloaded_images": all_images,
            },
        ),
    ]

builder = CelebA()
builder.download_and_prepare()

You can now call tfds.load('celeb_a') or builder = tfds.builder('celeb_a'); builder.download_and_prepare() and reuse the prepared dataset.

Hi there, I think there is a way and I’ve managed to do what I wanted. Here’s my code:-

image_paths = sorted(glob.glob(os.path.join('dataset', 'img_align_celeba', 'img_align_celeba', '*.jpg')))

df = pd.read_csv('dataset/list_attr_celeba.csv')
df.replace(to_replace = -1, value = 0, inplace = True)
labels = df.iloc[:, 1:].values

print(image_paths[:2])
print(labels[:2])
# prints
'''
['dataset/img_align_celeba/img_align_celeba/000001.jpg', 'dataset/img_align_celeba/img_align_celeba/000002.jpg']
[[0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 0
  1 0 0 1]
 [0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 1]]
'''

# here's the tensorflow part...
@tf.function
def read_image(image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_image(image, channels = 3, dtype = tf.float32)
    return image

@tf.function
def normalize(image):
    image = (image - tf.reduce_min(image))/(tf.reduce_max(image) - tf.reduce_min(image))
    image = (2 * image) - 1
    return image

@tf.function
def augment(image):
    image = tf.image.random_crop(image, (178, 178, 3))
    image = tf.image.resize(image, (256, 256))
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_saturation(image, 0.5, 2.0)
    image = tf.image.random_brightness(image, 0.5)
    return image

@tf.function
def preprocess(image_path, label):
    image = read_image(image_path)
    image = augment(image)
    image = normalize(image)
    return image, label

dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels))
dataset = dataset.map(preprocess, num_parallel_calls = tf.data.experimental.AUTOTUNE)
dataset = dataset.shuffle(buffer_size = 1024)
dataset = dataset.batch(batch_size = 128)
dataset = dataset.prefetch(buffer_size = tf.data.experimental.AUTOTUNE)

for x, y in dataset:
    break

print(x.shape, y.shape)
# prints
# (128, 256, 256, 3) (128, 40)

With this code, everything is working just fine for now. Also, do share your opinion for reading the data based on the above code, like performance and memory wise, and some modifications that I should make if any. And if everything is okay, then please add this example to the official tensorflow documentation, it’d very helpful for others as well.

Now one thing I’d still want to achieve, in the the augment function, I can’t get the dimensions of the image, like if I run

print(min(image.shape[:-1])) # :-1 to ignore the channels

inside this augment function to get the minimum of height and width, then it gives me <unknown>. I want to replace this line

image = tf.image.random_crop(image, (178, 178, 3))” with this

min_dim = min(image.shape[:-1])
image = tf.image.random_crop(image, (min_dim, min_dim, 3))

I tried removed the @tf.function decorator, but still not working.

So is a there a way I can get this last part done?

Thanks for your time.