onnx2tf: [YOLOX-TI] ERROR: onnx_op_name: /head/ScatterND
Issue Type
Others
onnx2tf version number
1.8.1
onnx version number
1.13.1
tensorflow version number
2.12.0
Download URL for ONNX
yolox_nano_ti_lite_26p1_41p8.zip
Parameter Replacement JSON
{
"format_version": 1,
"operations": [
{
"op_name": "/head/ScatterND",
"param_target": "inputs",
"param_name": "/head/Concat_1_output_0",
"values": [1,85,52,52]
}
]
}
Description
Hi @PINTO0309. After our lengthy discussion regarding INT8 YOLOX export I decided to try out Ti’s version of these models (https://github.com/TexasInstruments/edgeai-yolox/tree/main/pretrained_models). It looked to me that you manged to INT8-export those so maybe you could provide some hints 😄. I just downloaded the ONNX version of YOLOX-nano. For this model, the following fails:
onnx2tf -i ./yolox_nano.onnx -o yolox_nano_saved_model
The error I get:
ERROR: input_onnx_file_path: /datadrive/mikel/edgeai-yolox/yolox_nano.onnx
ERROR: onnx_op_name: /head/ScatterND
ERROR: Read this and deal with it. https://github.com/PINTO0309/onnx2tf#parameter-replacement
ERROR: Alternatively, if the input OP has a dynamic dimension, use the -b or -ois option to rewrite it to a static shape and try again.
ERROR: If the input OP of ONNX before conversion is NHWC or an irregular channel arrangement other than NCHW, use the -kt or -kat option.
ERROR: Also, for models that include NonMaxSuppression in the post-processing, try the -onwdt option.
- Research
- Export error
- I tried to overwrite the values of the parameter by the replacement json provided above with no luck
- Project need
- Operation that fails can be found in the image below:
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 70 (70 by maintainers)
Agree, let’s close this. Enough experimentation on this topic 😄 . Again, thank you both @motokimura, @PINTO0309 for time and guidance during this quantization journey. I learnt a lot, hopefully you got something out of the experiment results posted here as well 🙏
@mikel-brostrom As for the accuracy degradation of YOLOX integer quantization, I think it may be due to the distribution mismatch of xywh and score values.
Just before the last Concat, xywh seems to have a distribution of (min, max)~(0.0, 416.0). On the other hand, scores have a much narrower distribution of (min, max) = (0.0, 1.0) because of sigmoid.
In TFLite quantization, activation is quantized in per-tensor manner. That is, the OR distribution of xywh and scores, (min, max) = (0.0, 416.0), is mapped to integer values of (min, max) = (0, 255) after the Concat. As a result, even if the score is 1.0, after quantization it is mapped to: int(1.0 / 416 * 255) = int(0.61) = 0, resulting in all scores being zero!
A possible solution is to divide xywh tensors by the image size (416) to keep it in the range (min, max) ~ (0.0, 1.0) and then concat with the score tensor so that scores are not “collapsed” due to the per-tensor quantization.
The same workaround is done in YOLOv5: https://github.com/ultralytics/yolov5/blob/b96f35ce75effc96f1a20efddd836fa17501b4f5/models/tf.py#L307-L310
Great that we get this into YOLOv8 as well @motokimura! Thank you both for this joint effort ❤️
0.5:0.95
0.5
Going for a full COCO eval now 🚀
@PINTO0309 🚀 ! I just implemented what you explained here: https://github.com/PINTO0309/onnx2tf/issues/269#issuecomment-1488349530. What is the rationale behind this?
0.5:0.95
0.5
Just a hunch on my part, but if you do not
Concat
at the end, maybe there will be no accuracy degradation. I will have to try it out to find out. In the first place, I feel that the difference in value ranges is too large. ThenConcat
may not be relevant.Ref: https://github.com/PINTO0309/onnx2tf/issues/269#issuecomment-1483090981
By the way,
_int16_act
seems to be an experimental implementation of TFLite, so there are still many bugs or unsupported OPs. https://www.tensorflow.org/lite/performance/post_training_integer_quant_16x8@mikel-brostrom Thanks for sharing your results! https://github.com/PINTO0309/onnx2tf/issues/269#issuecomment-1486969872 The accuracy degradation because of the decoder is interesting…
You may find something if you compare the fp32/int8 TFLite final outputs. Even without onnx2tf’s new feature, you can do it by saving output arrays into npy files and then compare them.
The figure below is the one when I quantized YOLOv3. Left shows the distribution of
x
channel, and right shows the distribution ofw
channel. Orange is float, and blue is quantized.In YOLOv3 case above,
w
channel has large quantization error. If you can visualize the output distribution like this, we may find which channel (x
,y
,w
,h
, and/or,class
) causes this accuracy deguradation.A workaround has been implemented to avoid
ScatterND
shape mismatch errors as much as possible. In v1.8.3, the conversion succeeds as is even ifScatterND
is included, and the accuracy check has been improved to no problem.However, since NMS is included in the post-processing, accuracy verification with random data does not display very good results. For an accurate accuracy check, it is better to use a still image of the assumption used in the inference. This is because accuracy checks using random data may result in zero final output data counts.
https://github.com/PINTO0309/onnx2tf/releases/tag/1.8.3
In any case,
ScatterND
converts to a very verbose OP, so it is still better to create a model that replaces it withSlice
as much as possible.I won’t have time to check this out today @motokimura. But will report back tomorrow with my findings 😄. Thanks again for your time and guidance
Sorry for my late reply. I spent most of the day creating the benchmark result plot for yolox on the specific hardware I am using. I added delegate results as well.
hexagon
is skipped as the target device has no qualcomm chip. INT8 models don’t get a boost on this chip due to the lack of an INT8 ISA. GPU boosts make sense as the EXYNOS9810 contains a Mali-G72MP18 GPU, but inference speed is quite similar to using XNNPACK with 4 threads.Any idea why the memory footprint for the GPU delegate is so big compared to the others? Specially for the quantized one?
I am very interested. Probably other engineers besides myself as well.
Today and tomorrow will involve travel to distant places for work, which will slow down research and work.
Incidentally, Motoki seems to have succeeded in maintaining accuracy with INT8 quantization.
The model performance did not decrease after the changes and for the first time I got results on one of the quantized models (
dynamic_range_quant
).0.5:0.95
0.5
But still nothing for the
INT
ones though…Ok. As I didn’t see ScatterND in the original model, I checked what the differences where. I found out that this
gives:
While this:
gives:
This as well as some other minor fixes make it possible to get rid of ScatterND completely.
At this point I have no idea more than this comment about the quantization of Concat and what kind of quantization errors are happening inside actually… This Concat is not necessary by nature and has no benefit for the model quantization, so I think we don’t need go any deeper with this.
All I can say at this point is that tensors with very different value ranges should not be concatenated, especially in post-processing of the model.
Thank you for doing the experiment and sharing your results!
Interesting. It actually made it worse…
0.5:0.95
0.5
Yup, sorry @motokimura, that’s a typo. It is
outputs[:, :, 0:4] = outputs[:, :, 0:4] * 416
There is no part of the model left to explain in more detail than Motoki’s explanation, but again, take a good look at the quantization parameters around the final output of the model. I think you can see why
Concat
is a bad idea.All
1.7974882125854492 * (q + 128)
The values diverge when inverse quantization (
Dequantize
) is performed.Perhaps that is why TI used
ScatterND
.I will close this issue once the original problem has been solved and the INT8 quantization problem seems to have been resolved.
congratulations! 👍
It looks fine to me.
In/out quantization from top-left to bottom-right of the operations you pointed at:
Output looks like this now;
Get it!
Good to know 😄
Will double check everything tomorrow just to make sure there are no errors on my side
I explained it in a very simplified manner because it would be very complicated to explain in detail. You need to understand how onnx2tf checks the final and intermediate outputs.
Once you understand the principles of the accuracy checker, you will realize that minor errors can always occur, even if the model transformation is perfectly normal.
Matches
.Unmached
appears is the result of the precision check in Float32, regardless of whether it was quantized to INT8 or not.However, I am very concerned about the zero mAP in the last benchmark result. 👀
Errors of less than 1e-3 hardly make any difference to the accuracy of the model. Errors introduced by Mul can be caused by slight differences in fraction handling between ONNX and TensorFlow. Ignoring it will only cause a difference that is not noticeable to the human eye.
I tried a complete model export (including
--export-det
) following @motokimura’s instructions. I am aware of the fact that the post-processing step induces large errors on INT quantized models as showed here: https://github.com/PINTO0309/onnx2tf/issues/269#issuecomment-1484182307. Despite of all this I decided to proceed to check what performance I would get, as I want to do as little post-processing outside of the model as possible. These are my results:0.5:0.95
0.5
Sorry for all the experiment results I am dropping here. I hope they can help somebody going through a similar kind of processes. Without the
--export-det
I get the same results as @motokimura 😄I’m going to share how I quantized the nano model tonight. I’ve not yet done qualitative evaluation of the quantized model, but the detection result looks OK.
Apparently the benchmark binary can be run with nnapi delegate by
--use_nnapi=true
and with GPU delegate by--use_gpu=true
(source). This will give a better understanding of how this model actually performs with hardware accelerators. If anybody is interested I can upload those results as well 😄Here is a video of me running an INT8 quantized SSD on a RaspberryPi4 CPU (Debian 64bit) alone in 2020. https://www.youtube.com/watch?v=bd3lTBAYIq4
RaspberryPi4 (CPU only) + Python3.7 + Tensorflow Lite + MobileNetV2-SSDLite + Sync + MP4 640x360
15FPS (about 66ms/pred)
Cortex-A55 may be a bit old architecture. I am not very familiar with the details of the CPU architecture, but I think Coretex-A7x may have faster inference because of the implementation of faster operations with Neon instructions. Performance seems to vary considerably depending on whether Arm NN can be called from TFLite.
I compiled the benchmark binary for android_arm64. The device has a Exynos9810 which is arm 64-bit. It contains a Mali-G72MP18 GPU. However, I am running the model without GPU accelerators, so the INT8 model must be running on CPU. The CPU got released 2018 so that may explain why the quantized model is that slow…
I just cut the model at the point you suggested by:
But I get the following error:
I couldn’t find a similar issue and I had the same problem when I tried to cut YOLOX in our previous discussion. I probably misinterpreted how the tool is supposed to be used…
First, let me tell you that your results will vary greatly depending on the architecture of the CPU you are using for your verification. If you are using an Intel x64(x86) or AMD x64(x86) architecture CPU, the Float32 model should be able to reason about 10 times faster than the INT8 model. INT8 models are very slow on the x64 architecture. Perhaps the RaspberryPi’s ARM64 CPU 4 threads would be 10 times faster. The keyword XNNPACK is a good way to search for information. In the case of Intel’s x64 architecture, CPUs of the 10th generation or later differ from CPUs of the 9th generation or earlier in the presence or absence of an optimization mechanism for processing Integer. If you are using a 10th generation or later CPU, it should run about 20% faster.
Therefore, when benchmarking using benchmarking tools, it is recommended to try to do so on ARM64 devices.
The benchmarking in the discussion on the ultralytics thread is not appropriate.
Next, let’s look at dynamic range quantization. My tool does
per-channel
quantization by default. This is due to the TFLiteConverter specification.per-channel
quantization calculates the quantization range for each element of the tensor, which reduces the accuracy degradation and, at the same time, increases the cost of calculating the quantization range, which slows down the inference a little. Also, most of the current edge devices in the world are not optimized forper-channel
quantization. For example, EdgeTPU only supportsper-tensor
quantization. Therefore, if quantization is to be performed with the assumption that the model will be put to practical use in the future, it is recommended thatper-tensor
quantization be performed during the transformation as follows.per-channel
quantper-tensor
quantNext, we discuss post-quantization accuracy degradation. I think motoki’s point is mostly correct. I think you should first try to split the model at the red line and see how the accuracy changes.
If the
Sigmoid
in this position does not affect the accuracy, it should work. It is better to think about complex problems by breaking them down into smaller problems without being too hasty.hmm… As PINTO pointed out, it may be better to compare int8 and float model activations before the decoder part.
https://github.com/PINTO0309/onnx2tf/issues/269#issuecomment-1482738822
It may be helpful to export onnx without ‘–export-det’ option and compare the int8 and float outputs.
Feel free to play around with it
yolox_nano_no_scatternd.zip
😄