vision: Problems training Faster-RCNN from pretrained backbone

Is there any recommendation to train Faster-RCNN starting from the pretrained backbone? I’m using VOC 2007 dataset and I’m able to do transfer learning starting from:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes=21)

Using COCO pretrained ‘fasterrcnn_resnet50_fpn’ i’m able to obtain an mAP of 79% on VOC 2007 test set. Problems arise when i try to train from scratch using only the pretrained backbone:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes=21)

I have been trying to train this model for weeks but the highest mAP i got was 63% (again on test set).

Now, i know that training from scratch is harder, but i really would like to know how to set the training parameters to obtain a decent accuracy, in the future i may want to change the backbone and chances are that i will be not able to find a pretrained faster-rcnn on which i can do transfer learning.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 44 (16 by maintainers)

Most upvoted comments

@hktxt FYI i can get easily 72% mAP using the example provided in FasterRCNN source code using mobilenet_v2 as backbone:

    backbone = torchvision.models.mobilenet_v2(pretrained=True).features
    backbone.out_channels = 1280
    anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                       aspect_ratios=((0.5, 1.0, 2.0),))
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                    output_size=7,
                                                    sampling_ratio=2)
    model = torchvision.models.detection.faster_rcnn.FasterRCNN(backbone,
                       num_classes=21,
                       rpn_anchor_generator=anchor_generator,
                       box_roi_pool=roi_pooler)

no need to modify the BoxHead.

@fmassa I found out what my main problem was, I was using the val set for validation only. However, to get good result on PASCAL VOC 2007 you are supposed to use trainval all together. Also, thanks to @hktxt comment I got 66% accuracy training from scratch (just 3% less than the expected). If anyone is intereseted here the highlights:

Backbone

        vgg = torchvision.models.vgg16(pretrained=True)
        backbone = vgg.features[:-1]
        for layer in backbone[:10]:
            for p in layer.parameters():
                p.requires_grad = False
        backbone.out_channels = 512
        anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                           aspect_ratios=((0.5, 1.0, 2.0),))
        roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                        output_size=7,
                                                        sampling_ratio=2)

        class BoxHead(nn.Module):
            def __init__(self, vgg):
                super(BoxHead, self).__init__()
                self.classifier = nn.Sequential(*list(vgg.classifier._modules.values())[:-1])

            def forward(self, x):
                x = x.flatten(start_dim=1)
                x = self.classifier(x)
                return x
        box_head = BoxHead(vgg)

Model

        model = torchvision.models.detection.faster_rcnn.FasterRCNN(
            backbone, #num_classes,
            rpn_anchor_generator = anchor_generator,
            box_roi_pool = roi_pooler,
            box_head = box_head,
            box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(4096, num_classes=21))

Dataset

dataset = VOCDetection(img_folder=root, year='2007', image_set='trainval', transforms=transforms)

The only aumentation i used was RandomHorizontalFlip.

Parameters

--epochs 40
--lr-steps 30
--momentum 0.9
--lr-gamma 0.1

Thanks @lpuglia for the PR!

I’ll have a closer look to the PR (and get it merged) once I’m back from holidays

@fmassa here is the pull request: https://github.com/pytorch/vision/pull/1216 it should work out of the box.

@lpuglia I think we should add one example with Pascal VOC somewhere. If you could send an initial PR, I could look into improving it and merging it in torchvision.

@fmassa I tried them both, the first actually decrease the accuracy for some reasons, the second makes no difference. I will train from scratch using COCO and then use transfer learning to see if i can get 70% on Pascal, thanks for the help!

@hktxt my advice is to make sure to have the visibility checks enabled and use the following class for conversion:

class ConvertVOCtoCOCO(object):
    CLASSES = (
        "__background__", "aeroplane", "bicycle",
        "bird", "boat", "bottle", "bus", "car",
        "cat", "chair", "cow", "diningtable", "dog",
        "horse", "motorbike", "person", "pottedplant",
        "sheep", "sofa", "train", "tvmonitor",
    )
    def __call__(self, image, target):
        # return image, target
        anno = target['annotations']
        filename = anno["filename"].split('.')[0]
        h, w = anno['size']['height'], anno['size']['width']
        boxes = []
        classes = []
        objects = anno['object']
        if not isinstance(objects, list):
            objects = [objects]
        for obj in objects:
            bbox = obj['bndbox']
            bbox = [int(bbox[n]) - 1 for n in ['xmin', 'ymin', 'xmax', 'ymax']]
            boxes.append(bbox)
            classes.append(self.CLASSES.index(obj['name']))

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        classes = torch.as_tensor(classes)

        image_id = anno['filename'][:-4]
        image_id = torch.as_tensor([int(image_id)])

        target = {}
        target["boxes"] = boxes
        target["labels"] = classes
        target['name'] = image_id #convert filename in int8

        return image, target

Also (I don’t know if this is useful yet) but make sure to have a 10022 image dataset flipping all the images. This is different from random flipping because you make sure that every image is shown to the network twice in different orientation per epoch. If you use this strategy you will need just 15 epoch to train the netwrork. Here is my code:

class VOCDetection_flip(torchvision.datasets.VOCDetection):
    def __init__(self, img_folder, year, image_set, transforms):
        super(VOCDetection_flip, self).__init__(img_folder,  year, image_set)
        self._transforms = transforms

    def __getitem__(self, idx):
        real_idx = idx//2
        img, target = super(VOCDetection_flip, self).__getitem__(real_idx)
        target = dict(image_id=real_idx, annotations=target['annotation'])
        if self._transforms is not None:
            img, target = self._transforms(img, target)
            # img = img[[2, 1, 0],:]

            if (idx % 2) == 0:
                height, width = img.shape[-2:]
                img = img.flip(-1)
                bbox = target["boxes"]
                bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
                target["boxes"] = bbox

        return img, target

    def __len__(self):
        return 2*len(self.images)

@fmassa removing the visibility check decreases the accuracy from 66 to 64%

@fmassa It was enabled the whole time, I don’t know how much did it influenced the training, I’m gonna repeat the test commenting it out and let you know (my guess is that it doesn’t change much)