vision: ImageFolder(root) raises error if root contains empty subfolders
🐛 Describe the bug
When using torchvision.datasets.ImageFolder(root) on a root with a subfolder not containing images, an error is thrown.
Example code (works in google colab):
!git clone --depth 1 https://github.com/alexeygrigorev/clothing-dataset-small clothing_dataset_small
!mkdir clothing_dataset_small/empty_subfolder_XYZ
print("############################### Found the following subfolders:")
!ls clothing_dataset_small
print("############################### Trying to create an ImageFolder...")
import torchvision
torchvision.datasets.ImageFolder('clothing_dataset_small')
Output of example code
Cloning into 'clothing_dataset_small'...
remote: Enumerating objects: 3818, done.
remote: Counting objects: 100% (3818/3818), done.
remote: Compressing objects: 100% (3818/3818), done.
remote: Total 3818 (delta 0), reused 3815 (delta 0), pack-reused 0
Receiving objects: 100% (3818/3818), 100.57 MiB | 36.22 MiB/s, done.
############################### Found the following subfolders:
empty_subfolder_XYZ LICENSE README.md test train validation
############################### Trying to create an ImageFolder...
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-19-d8d3727bcc2a> in <module>()
5 print("############################### Trying to create an ImageFolder...")
6 import torchvision
----> 7 torchvision.datasets.ImageFolder('clothing_dataset_small')
3 frames
/usr/local/lib/python3.7/dist-packages/torchvision/datasets/folder.py in make_dataset(directory, class_to_idx, extensions, is_valid_file)
100 if extensions is not None:
101 msg += f"Supported extensions are: {', '.join(extensions)}"
--> 102 raise FileNotFoundError(msg)
103
104 return instances
FileNotFoundError: Found no valid file for the classes .git, empty_subfolder_XYZ. Supported extensions are: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, .webp
Source of error
https://github.com/pytorch/vision/blob/22ff44fd14139f2a056ad52b9bd109bd958089f3/torchvision/datasets/folder.py#L97-L102 As this check was introduced by @pmeier maybe he can describe the intuitions for this.
Versions
Collecting environment information… PyTorch version: 1.10.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.12.0 Libc version: glibc-2.26
Python version: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic Is CUDA available: False CUDA runtime version: 11.1.105 GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5 HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] numpy==1.19.5 [pip3] torch==1.10.0+cu111 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.11.0 [pip3] torchvision==0.11.1+cu111 [conda] Could not collect
cc @pmeier
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 30 (12 by maintainers)
You have convinced me that I should think of an
ImageFolderdifferently: It should not be any folder containing some subdirectories with images, but rather a curated folder with exactly one subdirectory for each class and at least one valid file for each class. Thanks for making this more clear to me.Hey @MalteEbner. IIRC, the motivation behind this is to avoid subtle errors. Otherwise you might get a “label gap”, because a directory is recognized as category, but has no samples. Take this setup:
Removing the check and running
now prints
Although you have only two categories, the model handling the data now needs to handle three categories.
Hello, I am wondering if you would consider reopening this issue, but for a different use case.
Basically, if I understand the discussion correctly, there was a request to allow subdirectories of a folder containing images that are empty, because they may contain something else (e.g. some metadata). In the end it was decided that this is not helpful, because a directory of image data should be carefully curated and not have those kinds of folders.
However, I have a different use case. I am working with the iWildcam dataset (https://wilds.stanford.edu/), but repackaged in the ImageFolder format. This dataset has the peculiarity that not all classes are present in all splits. So, if I load the test data with ImageFolder and do the same with validation data, the class folder-to-class-id mappings won’t line up.
A simple solution here would be to simply have empty class folders for classes not present in some of the splits. But this fails because of the exact safeguards discussed in this thread. It seems to me that an
--allow-emptyargument would be a (generally) helpful and elegant solution for use cases like mine.(I understand that the find_classes function can be overridden for this use case, but I agree with the previous commenter who said that this renders the whole ImageFolder convenience rather moot, plus it makes it harder to plug datasets like iWildcam into existing pipelines.)
Thank you @AdeelH . You’re right that if we assumed that every single folder has to be a class folder, then we could just get rid of the
raiseand avoid introducing a new parameter.One thing that is fairly clear from this thread however, is that there are as many different use-cases as there are users (I’m exaggerating a bit). In other words, if we were to make that assumption, we’d be leaving some users on the side and they’d come and open an issue for their use-cases to be supported.
That’s why I prefer going for the option to introduce a new parameter here: it may not provide the absolute best UX across all identified use-cases, but it does support them all in a not too terrible way. And at this point, it seems like the best trade-off to make.
This thread is quite long and to try to make things a bit clearer I’ll close it and I’ve opened https://github.com/pytorch/vision/issues/8297 to keep track of the progress on that task. We’re planning to support that for the next 0.18 release. Thanks all for your input.
By a strange coincidence, I came here to make the same request as @ohaijen on this years-old issue today.
Assuming that all classes will have training data at all times seems too strong an assumption to me. Other than the case that @ohaijen mentioned above, it might also be the case that a model is being fine-tuned on only a subset of the classes, without changing its architecture.
I’m currently using the workaround of detecting empty folders beforehand and overriding
DatasetFolder.find_classes(), but this behavior is not very intuitive to me. In general, it should be okay for there to be 0 samples from a class.Though I understand that there are cases where a dataset might contain empty and non-image folders (Python Environments, Version control, OS specific meta-files/folders, Python notebook checkpoints etc), having these folders in there violates the basic assumption of the API which is that we can scan the existing folders to fetch the labels. Yes it’s possible to create a complex solution such as the ones described here to handle corner-cases and ensure we don’t introduce a bug. Nevertheless it might also be worth considering that the end user might just have to clean up the dataset directory prior loading it OR subclassing our class to apply custom filtering logic.
After giving it a little more though it seems to me that there are 3 kinds of subdirectories: A) empty B) with files, but without valid files (e.g. a
.gitfolder) C) with valid files (e.g. withimg1.jpg,img2.jpg…)Solution 4 proposes to
FileNotFoundErrorif any subdirectory of type a) exists.class_to_idxonly from subdirectories of type CAs @pmeier pointed out, if
class_to_idxis passed explicitly, it should be checked if allclass_to_idx.keys()match a valid subdirectory of type C.This could be implemented as follows (partly pseudocode, should convey the main idea):
What do you think about this kind of proposed implementation, @NicolasHug @pmeier ? If you give your ok, I will start the implementation and create a PR.
I think this will fulfil the following requirements:
class_to_idxstill has a continuous id starting from 0 and having at least one valid file per class.class_to_idxif passed explicitly.Apart from 1, I would be OK with
This allows more flexibility for the user while maintaining BC, avoids silent bugs, and does not add any new API surface.
Instead of adding a new parameter, wouldn’t it be possible to simply override the
find_classes()method to ignore empty dirs?We made this public precisely to handle such custom use-cases and avoid bloating the API with lots of specific parameter.
Yes, I will create a PR with
ignore_empty. Can you assign the issue to me @pmeier ?