vision: ImageFolder(root) raises error if root contains empty subfolders

🐛 Describe the bug

When using torchvision.datasets.ImageFolder(root) on a root with a subfolder not containing images, an error is thrown.

Example code (works in google colab):

!git clone --depth 1  https://github.com/alexeygrigorev/clothing-dataset-small clothing_dataset_small
!mkdir clothing_dataset_small/empty_subfolder_XYZ
print("############################### Found the following subfolders:")
!ls clothing_dataset_small
print("############################### Trying to create an ImageFolder...")
import torchvision
torchvision.datasets.ImageFolder('clothing_dataset_small')

Output of example code

Cloning into 'clothing_dataset_small'...
remote: Enumerating objects: 3818, done.
remote: Counting objects: 100% (3818/3818), done.
remote: Compressing objects: 100% (3818/3818), done.
remote: Total 3818 (delta 0), reused 3815 (delta 0), pack-reused 0
Receiving objects: 100% (3818/3818), 100.57 MiB | 36.22 MiB/s, done.
############################### Found the following subfolders:
empty_subfolder_XYZ  LICENSE  README.md  test  train  validation
############################### Trying to create an ImageFolder...
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-d8d3727bcc2a> in <module>()
      5 print("############################### Trying to create an ImageFolder...")
      6 import torchvision
----> 7 torchvision.datasets.ImageFolder('clothing_dataset_small')

3 frames
/usr/local/lib/python3.7/dist-packages/torchvision/datasets/folder.py in make_dataset(directory, class_to_idx, extensions, is_valid_file)
    100         if extensions is not None:
    101             msg += f"Supported extensions are: {', '.join(extensions)}"
--> 102         raise FileNotFoundError(msg)
    103 
    104     return instances

FileNotFoundError: Found no valid file for the classes .git, empty_subfolder_XYZ. Supported extensions are: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, .webp

Source of error

https://github.com/pytorch/vision/blob/22ff44fd14139f2a056ad52b9bd109bd958089f3/torchvision/datasets/folder.py#L97-L102 As this check was introduced by @pmeier maybe he can describe the intuitions for this.

Versions

Collecting environment information… PyTorch version: 1.10.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.12.0 Libc version: glibc-2.26

Python version: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic Is CUDA available: False CUDA runtime version: 11.1.105 GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.19.5 [pip3] torch==1.10.0+cu111 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.11.0 [pip3] torchvision==0.11.1+cu111 [conda] Could not collect

cc @pmeier

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 30 (12 by maintainers)

Most upvoted comments

You have convinced me that I should think of an ImageFolder differently: It should not be any folder containing some subdirectories with images, but rather a curated folder with exactly one subdirectory for each class and at least one valid file for each class. Thanks for making this more clear to me.

MalteEbner on Nov 16, 2021

Hey @MalteEbner. IIRC, the motivation behind this is to avoid subtle errors. Otherwise you might get a “label gap”, because a directory is recognized as category, but has no samples. Take this setup:

dataset
├── a
│   └── a.png
├── b
└── c
    └── c.png

Removing the check and running

from torchvision.datasets.folder import ImageFolder
import pathlib

dataset = ImageFolder("dataset", loader=lambda path: pathlib.Path(path).name)

for path, category in dataset:
    print(path, category)

now prints

a.png 0
c.png 2

Although you have only two categories, the model handling the data now needs to handle three categories.

pmeier on Nov 12, 2021

Hello, I am wondering if you would consider reopening this issue, but for a different use case.

Basically, if I understand the discussion correctly, there was a request to allow subdirectories of a folder containing images that are empty, because they may contain something else (e.g. some metadata). In the end it was decided that this is not helpful, because a directory of image data should be carefully curated and not have those kinds of folders.

However, I have a different use case. I am working with the iWildcam dataset (https://wilds.stanford.edu/), but repackaged in the ImageFolder format. This dataset has the peculiarity that not all classes are present in all splits. So, if I load the test data with ImageFolder and do the same with validation data, the class folder-to-class-id mappings won’t line up.

A simple solution here would be to simply have empty class folders for classes not present in some of the splits. But this fails because of the exact safeguards discussed in this thread. It seems to me that an --allow-empty argument would be a (generally) helpful and elegant solution for use cases like mine.

(I understand that the find_classes function can be overridden for this use case, but I agree with the previous commenter who said that this renders the whole ImageFolder convenience rather moot, plus it makes it harder to plug datasets like iWildcam into existing pipelines.)

ohaijen on Feb 12, 2024

Thank you @AdeelH . You’re right that if we assumed that every single folder has to be a class folder, then we could just get rid of the raise and avoid introducing a new parameter.

One thing that is fairly clear from this thread however, is that there are as many different use-cases as there are users (I’m exaggerating a bit). In other words, if we were to make that assumption, we’d be leaving some users on the side and they’d come and open an issue for their use-cases to be supported.

That’s why I prefer going for the option to introduce a new parameter here: it may not provide the absolute best UX across all identified use-cases, but it does support them all in a not too terrible way. And at this point, it seems like the best trade-off to make.

This thread is quite long and to try to make things a bit clearer I’ll close it and I’ve opened https://github.com/pytorch/vision/issues/8297 to keep track of the progress on that task. We’re planning to support that for the next 0.18 release. Thanks all for your input.

NicolasHug on Mar 5, 2024

By a strange coincidence, I came here to make the same request as @ohaijen on this years-old issue today.

Assuming that all classes will have training data at all times seems too strong an assumption to me. Other than the case that @ohaijen mentioned above, it might also be the case that a model is being fine-tuned on only a subset of the classes, without changing its architecture.

I’m currently using the workaround of detecting empty folders beforehand and overriding DatasetFolder.find_classes(), but this behavior is not very intuitive to me. In general, it should be okay for there to be 0 samples from a class.

AdeelH on Feb 12, 2024

Though I understand that there are cases where a dataset might contain empty and non-image folders (Python Environments, Version control, OS specific meta-files/folders, Python notebook checkpoints etc), having these folders in there violates the basic assumption of the API which is that we can scan the existing folders to fetch the labels. Yes it’s possible to create a complex solution such as the ones described here to handle corner-cases and ensure we don’t introduce a bug. Nevertheless it might also be worth considering that the end user might just have to clean up the dataset directory prior loading it OR subclassing our class to apply custom filtering logic.

datumbox on Nov 16, 2021

After giving it a little more though it seems to me that there are 3 kinds of subdirectories: A) empty B) with files, but without valid files (e.g. a .git folder) C) with valid files (e.g. with img1.jpg, img2.jpg …)

Solution 4 proposes to

raise the FileNotFoundError if any subdirectory of type a) exists.
ignore all subdirectories of type B
create the dataset with the class_to_idx only from subdirectories of type C

As @pmeier pointed out, if class_to_idx is passed explicitly, it should be checked if all class_to_idx.keys() match a valid subdirectory of type C.

This could be implemented as follows (partly pseudocode, should convey the main idea):

empty_directores = []
non_empty_directories = []
valid_directories = []
for entry in os.scandir(directory):
  if entry.is_dir():
   class_name = entry.name
    subdirectory = os.path.join(directory, class_name)
    directory_type = "empty" # assume the subdirectory is empty, type A
    for root, _, fnames in sorted(os.walk(, followlinks=True)):
        for fname in sorted(fnames):
            directory_type = "non_empty" # the subdirectory has at least one file, type B
            path = os.path.join(root, fname)
            if is_valid_file(path):
                item = path, class_index
                instances.append(item)
                directory_type = "with_valid_files" # the subdirectory has valid files, type C
   if directory_type = "empty":
     empty_directories.append(class_name)
  elif directory_type = "non_empty":
     non_empty_directories.append(class_name)
   else:
     valid_directories.append(class_name)
     
if len(empty_directories) >  0:
  # raise error and name those directories, just as before

if class_to_idx is None:
  class_to_idx = {class: id for id, class in enumerate(valid_directories)}
else:
  classes_not_found = set(class_to_idx.keys()) - set(valid_directories)
   if classes_not_found:
       # raise ValueError, tell which classes are missing

What do you think about this kind of proposed implementation, @NicolasHug @pmeier ? If you give your ok, I will start the implementation and create a PR.

I think this will fulfil the following requirements:

Ensure that empty directories still cause an error.
Ensure that non-empty directories will be ignored (which is new)
Ensure that class_to_idx still has a continuous id starting from 0 and having at least one valid file per class.
Ensure that there is a valid file for every class in class_to_idx if passed explicitly.

MalteEbner on Nov 16, 2021

Apart from 1, I would be OK with

Allow sub-directories that do not contain images, without warning, and properly filter-out these folders to avoid the “label gap” issue that @pmeier mentioned https://github.com/pytorch/vision/issues/4925#issuecomment-966994668

This allows more flexibility for the user while maintaining BC, avoids silent bugs, and does not add any new API surface.

NicolasHug on Nov 15, 2021

Instead of adding a new parameter, wouldn’t it be possible to simply override the find_classes() method to ignore empty dirs?

We made this public precisely to handle such custom use-cases and avoid bloating the API with lots of specific parameter.

NicolasHug on Nov 15, 2021

Sure, that would be a good addition. I would prefer something like ignore_empty because it is a little shorter. Do you want to send a PR?

Yes, I will create a PR with ignore_empty. Can you assign the issue to me @pmeier ?

MalteEbner on Nov 12, 2021