PSPNet-Keras-tensorflow: Bad results(Not bad now)

Although the converted weights produce plausible predictions, they are not yet up to the published results of the PSPNet paper.

Current results on cityscapes validation set:

classes          IoU      nIoU
--------------------------------
road          : 0.969      nan
sidewalk      : 0.776      nan
building      : 0.871      nan
wall          : 0.532      nan
fence         : 0.464      nan
pole          : 0.302      nan
traffic light : 0.375      nan
traffic sign  : 0.567      nan
vegetation    : 0.872      nan
terrain       : 0.591      nan
sky           : 0.905      nan
person        : 0.585    0.352
rider         : 0.253    0.147
car           : 0.897    0.698
truck         : 0.606    0.284
bus           : 0.721    0.375
train         : 0.652    0.388
motorcycle    : 0.344    0.147
bicycle       : 0.618    0.348
--------------------------------
Score Average : 0.626    0.342
--------------------------------


categories       IoU      nIoU
--------------------------------
flat          : 0.974      nan
nature        : 0.876      nan
object        : 0.397      nan
sky           : 0.905      nan
construction  : 0.872      nan
human         : 0.603    0.376
vehicle       : 0.879    0.676
--------------------------------
Score Average : 0.787    0.526
--------------------------------

Accuracy of the published code on several validation/testing sets according to the author:

PSPNet50 on ADE20K valset (mIoU/pAcc): 41.68/80.04 
PSPNet101 on VOC2012 testset (mIoU): 85.41 (multiscale evaluation!)
PSPNet101 on cityscapes valset (mIoU/pAcc): 79.70/96.38

So we are still missing 79.70 - 62.60 = 17.10 % IoU

Does anyone have an idea where we lose that accuracy?

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 24 (15 by maintainers)

Most upvoted comments

@wtliao Unfortunately, I did not (yet) train these weights myself. Sliced/sliding prediction has so much more details because the weights were trained on 714x714 crops from the full resolution image and the 714x714 crops are used for prediction instead of a downsampled 512x256 image that is then upsampled to full resolution. Flipped evaluation means predicting on the image and a vertically flipped image at the same time and averaging the results.