cvat: [GSoC2024] Filenames with same name but different extensions cause error

Actions before raising this issue

  • I searched the existing issues and did not find anything similar.
  • I read/searched the docs

Steps to Reproduce

  1. Create a new task on CVAT.
  2. upload images named image1.png and image1.jpg
  3. upload COCO annotations
  4. Will throw error if you have an image named image1.jpg and another image named image1.png even if both the images are very different:
Could not upload annotation for the [task 4](http://localhost:8080/tasks/4)

Item ('image1', 'default') is repeated in the source sequence..

COCO annotation json:

{
    "info": {
        "description": "my-project-name"
    },
    "images": [
        {
            "id": 1,
            "width": 1200,
            "height": 1600,
            "file_name": "image1.jpg"
        },
        {
            "id": 2,
            "width": 2592,
            "height": 1944,
            "file_name": "image1.png"
        }
    ],
    "annotations": [
        {
            "id": 0,
            "iscrowd": 0,
            "image_id": 1,
            "category_id": 1,
            "segmentation": [
                [
                    787.1904355251921,
                    419.47053800170795,
                    850.0426985482494,
                    710.5038428693424,
                    500.25619128949614,
                    639.4534585824082,
                    778.9923142613151,
                    420.8368915456874
                ]
            ],
            "bbox": [
                500.25619128949614,
                419.47053800170795,
                349.78650725875326,
                291.03330486763446
            ],
            "area": 49372.61940096601
        },
        {
            "id": 1,
            "iscrowd": 0,
            "image_id": 2,
            "category_id": 1,
            "segmentation": [
                [
                    1086.8249359521776,
                    424.99060631938517,
                    1204.6934244235697,
                    848.3210930828352,
                    885.9504696840307,
                    800.1776259607174
                ]
            ],
            "bbox": [
                885.9504696840307,
                424.99060631938517,
                318.74295473953896,
                423.33048676345004
            ],
            "area": 64629.50624142664
        }
    ],
    "categories": [
        {
            "id": 1,
            "name": "object"
        }
    ]
}

Removing entry or doing other things throws other errors. This really should not happen because COCO supports adding extensions to the file name

Expected Behavior

It should allow a person to upload files and COCO annotations even if file name is same, as long as the file extension is different

Possible Solution

It should allow a person to upload files and COCO annotations even if file name is same, as long as the file extension is different

Context

I want to add that in general CVAT has the tendency to completely abort any upload operation the moment it finds a single error. this results in huge time loss just trying to debug CVAT errors (which technically shouldn’t even be errors since this is a perfect use of the COCO file format). If a few extra images are annotated in COCO, it will throw an error instead of just annotating the images present in the job and completely abort the operation. if a single annotation has an error, it will abort the operation again. Instead, it should just warn you and try and import as many annotations as possible, like roboflow does.

Environment

CVAT website, or even locally installed CVAT. all cause this issue.

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 26 (23 by maintainers)

Commits related to this issue

Most upvoted comments

@adkbbx, answered in the PR.

@adkbbx, well, there are 2 more steps to implement, as I wrote in https://github.com/opencv/cvat/issues/7523#issuecomment-1988060365.

@adkbbx,

So I wanted to know is there a way I could iterate through the source file to update the image names with a uuid.

Basically, in the comment above https://github.com/opencv/cvat/issues/7523#issuecomment-1988634376 you already did this (the function update_annotation_file). I think it’s enough for updating the input file.

@adkbbx,

Each individual file should ideally have a unique ID, am I correct? Please let me know if my understanding is correct.

Yes, this is how it should be.

I was considering updating the image names using a UID, but I’m concerned that this might lead to errors during the mapping of annotations on the actual images if we don’t also update the image IDs. It’s possible there could be a logic error in the export function of the COCO dataset as well. I would greatly appreciate your insights on this matter.

I’m not sure I understand what you meant here. Speaking about exporting, I think it should work similarly - a mapping is created in what’s returned by this function, then the dataset is exported as usual, then files are mapped in the output jsons.

@adkbbx, probably, it will be more comfortable to review code in a PR, please create one.

where this script should be placed for optimal functionality without disrupting existing processes

I expect it to be some code in this and this file.

  1. Update names of datasets incase duplicate names with different extension are found, rename the dataset names and save them in a rename_mapping variable.

Probably, you can just replace all the file names, don’t need to resolve just the repeated ones. Simply iterating over the json’s images list and updating names inplace, while remembering the new names, should be enough. The code you attached iterates over images in a directory, but when annotations are imported, you don’t have images in the input files.

Using such a pattern to resolve conflicts:

Rename_mapping = {‘image1.gif’: ‘image1_0.gif’, ‘image1.jpg’: ‘image1_1.jpg’, ‘image1.png’: ‘image1_2.png’, ‘image3.jpg’: ‘image3_0.jpg’, ‘image3.png’: ‘image3_1.png’}

Can lead to new conflicts, that’s why just replacing all the names with something new and unique is better.

Also, consider using Path and uuid.uuid4, plain numbers, or hashes in the implementation.

In the annotation files you attached all the images have id 1, this is not correct. Consider creating one just by exporting from CVAT or make sure the file is correct.

Hi @adkbbx,

It was an architecture decision in Datumaro, so changing it there looks quite a hard way to fix the problem. As the problem is a part of Datumaro design, it affects several formats in CVAT. Actually, I think we can implement the 1st variant from what you suggested. Probably, the implementation should look like this:

  1. rename images in json to something unique each (uuid or simple numbers), save the mapping
  2. load the dataset as usual with Datumaro
  3. add filename mapping option in import_dm_annotations / match_dm_item, supply the mapping
  4. map ids back to original names in match_dm_item, maybe add a new matching case for this

This solution can be reused for different formats, if needed.

@adkbbx , I have assigned. Please try to reproduce the issue first. After you reproduce it, please propose a solution here. My team will help you to polish the proposal.