mace: Caffe model validation fails on MACE v0.12.0 due to low similarity

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu16.04 (MACE Docker image)
NDK version(e.g., 15c): 18b
GCC version(if compiling for host, e.g., 5.4.0): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
MACE version (Use the command: git describe --long --tags): 0.12.0
Python version(2.7): 3.6
Bazel version (e.g., 0.13.0): [0.16.0]
CMake version: 3.16.0

Model deploy file (*.yml)

# The name of library
library_name: libretinanet
target_abis: [arm64-v8a]
model_graph_format: code
model_data_format: code
models:
  retinanet:
    platform: caffe
    model_file_path: /models/retinanet/retinanet3.prototxt
    weight_file_path: /models/retinanet/retinanet3.caffemodel
    model_sha256_checksum: 638e05fc466737c3b8fc36261adaaff40cbd4de5a8c72a46b37f2b00f01180e1
    weight_sha256_checksum: 6222910a773c693c23b4765baba4ed8427e9f3c11781060918e6282a297a7437
    subgraphs:
      - input_tensors:
          - data
        input_shapes:
          - 1,3,320,320
        input_data_formats:
          - NCHW
        output_tensors:
          - face_rpn_cls_prob_reshape_stride32
          - face_rpn_bbox_pred_stride32
          - face_rpn_landmark_pred_stride32
          - face_rpn_cls_prob_reshape_stride16
          - face_rpn_bbox_pred_stride16
          - face_rpn_landmark_pred_stride16
          - face_rpn_cls_prob_reshape_stride8
          - face_rpn_bbox_pred_stride8
          - face_rpn_landmark_pred_stride8
        output_shapes:
          - 1,4,10,10
          - 1,8,10,10
          - 1,20,10,10
          - 1,4,20,20
          - 1,8,20,20
          - 1,20,20,20
          - 1,4,40,40
          - 1,8,40,40
          - 1,20,40,40
        output_data_formats:
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
          - NCHW
    obfuscate: 0
    runtime: gpu
    winograd: 4

Describe the problem

With MACE v0.12.0, the output of converted model is significantly different. As a result, it fails to run in validation mode. In addition, I tried to run unit tests on Andorid Studio and found the output difference. However, with MACE v0.11.0-rc1, the output is fine and the validation runs successfully.

========================================================
     capability(CPU)        init      warmup     run_avg
========================================================
time           7.484     877.331    1323.142      16.371
I mace/libmace/mace.cc:636] Destroying MaceEngine
Running finished!

* Validate with caffe
Pull /data/local/tmp/mace_run/model_out_face_rpn_cls_prob_reshape_stride32 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_bbox_pred_stride32 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_landmark_pred_stride32 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_cls_prob_reshape_stride16 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_bbox_pred_stride16 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_landmark_pred_stride16 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_cls_prob_reshape_stride8 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_bbox_pred_stride8 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Pull /data/local/tmp/mace_run/model_out_face_rpn_landmark_pred_stride8 to build/libretinanet/_tmp/retinanet/14a062bf9f488e2a38c1fb60b18a80de/MI9_msmnile/arm64-v8a
Traceback (most recent call last):
  File "/mace/validate.py", line 459, in <module>
face_rpn_cls_prob_reshape_stride32 MACE VS CAFFE similarity: 0.7051022200187764 , sqnr: 1.988659806883121 , pixel_accuracy: 0.4
    FLAGS.log_file)
  File "/mace/validate.py", line 371, in validate
    validation_threshold, log_file)
  File "/mace/validate.py", line 262, in validate_caffe_model
    value, validation_threshold, log_file)
  File "/mace/validate.py", line 113, in compare_output
    "", common.StringFormatter.block("Similarity Test Failed"))
TypeError: summary() takes exactly 1 argument (2 given)
Traceback (most recent call last):
  File "tools/converter.py", line 1151, in <module>
    flags.func(flags)
  File "tools/converter.py", line 938, in run_mace
    device.run_specify_abi(flags, configs, target_abi)
  File "/mace/tools/device.py", line 782, in run_specify_abi
    log_file=log_file,
  File "/mace/tools/sh_commands.py", line 756, in validate_model
    _fg=True)
  File "/root/.pyenv/versions/3.6.3/lib/python3.6/site-packages/sh.py", line 1413, in __call__
    raise exc
sh.ErrorReturnCode_1: 

  RAN: /usr/bin/docker exec mace_caffe_lastest_validator python -u /mace/validate.py --platform=caffe --model_file=/mace/retinanet3.prototxt --weight_file=/mace/retinanet3.caffemodel --input_file=/mace/model_input --mace_out_file=/mace/model_out --device_type=GPU --input_node=data --output_node=face_rpn_cls_prob_reshape_stride32,face_rpn_bbox_pred_stride32,face_rpn_landmark_pred_stride32,face_rpn_cls_prob_reshape_stride16,face_rpn_bbox_pred_stride16,face_rpn_landmark_pred_stride16,face_rpn_cls_prob_reshape_stride8,face_rpn_bbox_pred_stride8,face_rpn_landmark_pred_stride8 --input_shape=1,3,320,320 --output_shape=1,4,10,10:1,8,10,10:1,20,10,10:1,4,20,20:1,8,20,20:1,20,20,20:1,4,40,40:1,8,40,40:1,20,40,40 --input_data_format=NCHW --output_data_format=NCHW,NCHW,NCHW,NCHW,NCHW,NCHW,NCHW,NCHW,NCHW --validation_threshold=0.995000 --input_data_type=float32 --backend=tensorflow --validation_outputs_data= --log_file=

  STDOUT:


  STDERR:

To Reproduce

Steps to reproduce the problem:

1. cd /path/to/mace
2. python tools/converter.py convert --config_file=/models/retinanet/retinanet3.yml
2. python tools/converter.py run --validate --config_file=/models/retinanet/retinanet3.yml

Error information / logs

Please refer to this gist link for full conversion and validation log: MACE v0.12.0 Error log - validation failed · GitHub

Additional context

For MACE v0.12.0, I followed a workaround from the last answer of this issue https://github.com/XiaoMi/mace/issues/560

diff --git a/tools/python/transform/transformer.py b/tools/python/transform/transformer.py
index bb9154f..bbf14b4 100644
--- a/tools/python/transform/transformer.py
+++ b/tools/python/transform/transformer.py
@@ -1353,7 +1353,7 @@ class Transformer(base_converter.ConverterInterface):
         visited = set()
         sorted_nodes = []
 
-        output_nodes = self._option.check_nodes.keys()
+        output_nodes = list(self._option.check_nodes.keys())
         if not self._quantize_activation_info:
             output_nodes.extend(self._option.output_nodes)
         for output_node in output_nodes:

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 21 (8 by maintainers)

Most upvoted comments

@gasgallo Sorry, Perhaps tomorrow.

lu229 on Mar 13, 2020

We can try to fix this issue at https://github.com/XiaoMi/mace/pull/611

gasgallo on Mar 20, 2020

@gasgallo Thank you, Both v0.11.0-rc4 and v0.11.0-rc1 can support your model correctly. The file contains the bug is “/mace/mace/ops/opencl/image/reshape.cc”, When the model is from Caffe, we should transform the data from NHWC to NCHW format before the Resize invoke, after the Resize invoke, we should transform the data form NCHW to NHWC format.

lu229 on Mar 16, 2020

@mexeniz It seems that MACE has a bug in the support of the Split layer from Caffe, I’m busy these days, you can debug it yourself (reference this)or I’ll check it out in few days.

lu229 on Mar 3, 2020

@mexeniz Sorry for late reply, please use the v0.12.0 and apply this patch. with v0.12.0 I have checked your model and found no errors.

lu229 on Feb 26, 2020