datasets: Corrupt files in the `dogs_vs_cats` dataset

Short description I encountered this bug during my TensorFlow certification exam, when trying to work with images from the dataset you constantly get the message Corrupt JPEG data: 228 extraneous bytes before marker 0xd9 again and again, and it takes forever to iterate over the data once with that, I couldn’t complete my exam because of that.

Environment information

  • Operating System: window 10
  • Python version: 3.7.4
  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets version 3.1.0
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version: tensorflow-gpu version 2.2.0

Reproduction instructions A very simple way to reproduce the bug:

dataset_name = 'cats_vs_dogs'
dataset, info = tfds.load(name=dataset_name, 
                          split=tfds.Split.TRAIN,
                          with_info=True)

for i in dataset:
    print(i)

Expected behavior I except to be able to iterate over all the images without getting errors and without it taking forever to complete a single iteration.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 24 (6 by maintainers)

Most upvoted comments

I have seen a similar error in JPEG reading functions of several libraries, not just tensorflow, so I think this is an error in the underlying image decoding library employed. You can get around this issue by re-encoding and writing the JPEG images. It’s an expensive operation, but you should only need to do it once.

I manipulated the image removal function provided for the dataset. On my machine, this fixed the Corrupt JPEG error. Note also that my directory name is “data/cats_dogs”, which is different than the default directory name.

import os
import tensorflow as tf
from tensorflow.io import read_file, write_file
from tensorflow.image import decode_image

should_rewrite_image = True # set to true if you are getting Corrupt Data error
num_skipped = 0
for folder_name in ("Cat", "Dog"):
    folder_path = os.path.join('data/cats_dogs', folder_name)
    for fname in os.listdir(folder_path):
        fpath = os.path.join(folder_path, fname)
        is_jfif = True
        should_remove = False
        
        with open(fpath, "rb") as fobj:
            is_jfif = tf.compat.as_bytes("JFIF") in fobj.peek(10)
            
        try:
            img = read_file(fpath)
            if not tf.io.is_jpeg(img):
                should_remove = True
                
            img = decode_image(img)

            if img.ndim != 3:
                should_remove = True

        except Exception as e:
            should_remove = True
        
        if (not is_jfif) or should_remove:
            num_skipped += 1
            # Delete corrupted image
            os.remove(fpath)
        elif should_rewrite_image:
            tmp = tf.io.encode_jpeg(img)
            write_file(fpath, tmp)

print("Deleted %d images" % num_skipped)

Hope this helps others as a workaround.

I came across the same problem too. I had downloaded the dataset from Kaggle and tried running it on my local machine. But when I called model.fit() the training stopped with error.

My solution was to write a code to try and open files and if there is any error, remove the required file. Also, if the number of channels (or dimensions) in the image are not 3 (reed, green, blue channels) then also I will remove the file. After running this code on the dataset I was able to get the model to train without any issues.

My code:

from pathlib import Path
from tensorflow.io import read_file
from tensorflow.image import decode_image

# data_dir is of type Path and points to the parent dir
# parent dir contains the directories 'Dog' and 'Cat'
# run the same code for the dir 'Cat' to remove corrupt files 
for image in sorted((data_dir/'Dog').glob('*')):
    try:
        img = read_file(str(image))
        img = decode_image(img)
        
        if img.ndim != 3:
            print(f"[FILE_CORRUPT] {str(image).split('/')[-1]} DELETED")
            image.unlink()
            
    except Exception as e:
        print(f"[ERR] {str(image).split('/')[-1]}: {e} DELETED")
        image.unlink()