datasets: Corrupt files in the `dogs_vs_cats` dataset
Short description
I encountered this bug during my TensorFlow certification exam, when trying to work with images from the dataset you constantly get the message Corrupt JPEG data: 228 extraneous bytes before marker 0xd9 again and again, and it takes forever to iterate over the data once with that, I couldn’t complete my exam because of that.
Environment information
- Operating System: window 10
- Python version: 3.7.4
tensorflow-datasets/tfds-nightlyversion:tensorflow-datasetsversion 3.1.0tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpuversion:tensorflow-gpuversion 2.2.0
Reproduction instructions A very simple way to reproduce the bug:
dataset_name = 'cats_vs_dogs'
dataset, info = tfds.load(name=dataset_name,
split=tfds.Split.TRAIN,
with_info=True)
for i in dataset:
print(i)
Expected behavior I except to be able to iterate over all the images without getting errors and without it taking forever to complete a single iteration.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 24 (6 by maintainers)
I have seen a similar error in JPEG reading functions of several libraries, not just tensorflow, so I think this is an error in the underlying image decoding library employed. You can get around this issue by re-encoding and writing the JPEG images. It’s an expensive operation, but you should only need to do it once.
I manipulated the image removal function provided for the dataset. On my machine, this fixed the Corrupt JPEG error. Note also that my directory name is “data/cats_dogs”, which is different than the default directory name.
Hope this helps others as a workaround.
I came across the same problem too. I had downloaded the dataset from Kaggle and tried running it on my local machine. But when I called
model.fit()the training stopped with error.My solution was to write a code to try and open files and if there is any error, remove the required file. Also, if the number of channels (or dimensions) in the image are not 3 (reed, green, blue channels) then also I will remove the file. After running this code on the dataset I was able to get the model to train without any issues.
My code: