mmocr: Getting AssertionError: UniformConcatDataset: OCRDataset: AnnFileLoader: when build dataset

Upon building dataset, the compiler return AssertionError.

The complete trace-back is as below.

AssertionError: 


During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)

AssertionError: AnnFileLoader: 


During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)

AssertionError: OCRDataset: AnnFileLoader: 


During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)

[/usr/local/lib/python3.7/dist-packages/mmcv/utils/registry.py](https://localhost:8080/#) in build_from_cfg(cfg, registry, default_args)
     67     except Exception as e:
     68         # Normal TypeError does not print class name.
---> 69         raise type(e)(f'{obj_cls.__name__}: {e}')
     70 
     71 

AssertionError: UniformConcatDataset: OCRDataset: AnnFileLoader:

In this case, I am using annotation of the format jsonl. Hence, the both the parser under the loader for training and testing was set to LineJsonParser.

loader_dt_train = dict(type='AnnFileLoader',
                            repeat=1,                   
                            file_format='jsonl',
                            file_storage_backend='disk',
                            parser=dict(type='LineJsonParser',
                                        keys=['filename', 'text']))

loader_dt_test = dict(type = 'AnnFileLoader',
                        repeat = 1,
                        file_format = 'jsonl',
                        file_storage_backend = 'disk',
                        parser = dict(type = 'LineJsonParser',
                                    keys = ['filename', 'text']))

train_datasets1 = dict(type='OCRDataset',
                       img_prefix=img_prefix,
                       ann_file=train_anno_file1,
                       loader=loader_dt_train,
                       pipeline=None,           
                       test_mode=False)



val_dataset = dict(type='OCRDataset',
                   img_prefix=img_prefix,
                   ann_file=train_anno_file1,
                   loader=loader_dt_test,
                   pipeline=None,               
                   test_mode=True)

I think, the type (e.g., AnnFileLoader and OCRDataset )has been assigned properly.

May I know what is the issue.

The full code and issue can be reproduced via this Notebook.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 21 (10 by maintainers)

Most upvoted comments

Hi @balandongiv, I find your loader config was wrong. Change file_format from jsonl to txt can solve this issue:

loader_dt_train = dict(type='AnnFileLoader',
                            repeat=1,                   
                            file_format='txt',  # only txt and lmdb
                            file_storage_backend='disk',
                            parser=dict(type='LineJsonParser',
                                        keys=['filename', 'text']))

loader_dt_test = dict(type = 'AnnFileLoader',
                        repeat = 1,
                        file_format = 'txt',  # only txt and lmdb
                        file_storage_backend = 'disk',
                        parser = dict(type = 'LineJsonParser',
                                    keys = ['filename', 'text']))

Loader only accepts txt or lmdb as file_format. Essentially, jsonl files are stored as raw texts but parsed differently. Sorry for the confusion. I do think it has an inconsistent design with data converters and should be fixed soon.

gaotongxiao on May 21, 2022