tensorflow: Model converted to TFLite always returns NaN as output.
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.13.6
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: –
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): v1.10.0-12-g4dcfddc5d1 1.10.1
- Python version: 3.6.5
- Bazel version (if compiling from source): –
- GCC/Compiler version (if compiling from source): –
- CUDA/cuDNN version: –
- GPU model and memory: Intel Iris Plus Graphics 650 1536 MB
- Exact command to reproduce:
python3 test.py
Describe the problem
I have been trying to convert a frozen graph trained using this repo for using on android with TFLite. Trained model uses MobileNetV2 as frontend and Mobile UNet for Semantic Segmentation as the model. The problem I am facing is: the frozen pb graph segments the image correctly but TFLite converted model returns all nan for the output. To try the problem I wrote the following script. The model is converted without any errors or warnings, but the output is not correct. Do you have any idea what might be causing this?
Note: converted model is also returning NaNs on android device.
Frozen graph: output_graph.pb
Source code / logs
test.py
import tensorflow as tf
import numpy as np
import cv2
from tensorflow.python.platform import gfile
from tensorflow.contrib.lite.python.convert_saved_model import set_tensor_shapes
sess = tf.Session()
# load graph
with gfile.FastGFile('output_graph.pb', 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
sess.graph.as_default()
tf.import_graph_def(graph_def, name='')
# get tensors
input_tensor = sess.graph.get_tensor_by_name('Placeholder:0')
output_tensor = sess.graph.get_tensor_by_name('logits/Conv2D:0')
# generate random image
input_image = np.array(np.random.random_sample(
[1, 128, 128, 3]), dtype=np.float32)
# run the model with tf
output_image = sess.run(output_tensor, feed_dict={input_tensor: input_image})
# print tf output
print('--- Tensorflow output ---')
print(output_image)
print('-------------------------')
# set shapes
input_tensor.set_shape([1, 128, 128, 3])
output_tensor.set_shape([1, 128, 128, 32])
# convert model
converter = tf.contrib.lite.TocoConverter.from_session(
sess, [input_tensor], [output_tensor])
tflite_model = converter.convert()
# Prepare interpreter
interpreter = tf.contrib.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# set input data
interpreter.set_tensor(input_details[0]['index'], input_image)
# run model on interpreter
interpreter.invoke()
# retrive output
output_data = interpreter.get_tensor(output_details[0]['index'])
# print tflite output
print('--- TFLite output ---')
print(output_data)
print('---------------------')
output
--- Tensorflow output ---
[[[[-14.484754 -14.454916 -3.9344878 ... -10.294399
-2.837898 -8.190185 ]
[-14.120294 -10.590508 -4.032942 ... -6.7745924
-0.4497184 -9.78646 ]
[-14.561665 -10.49988 -8.065053 ... -7.422716
-0.7991432 -10.160792 ]
...
[-13.12197 -7.3976164 -7.1669674 ... -9.533363
-2.0361094 -10.951963 ]
[-15.041047 -7.3879066 -6.724542 ... -11.897878
-2.1202648 -13.670592 ]
[-14.483544 -10.037312 -6.356632 ... -12.075281
-2.2860763 -10.284541 ]]
[[-10.372202 -13.09114 -3.6517806 ... -7.623592
-1.8009435 -6.817739 ]
[-10.72727 -10.886565 -5.621975 ... -7.8185344
-1.4768337 -10.389865 ]
[-11.611484 -10.158413 -7.931344 ... -4.938987
-0.23626254 -8.830031 ]
...
[-12.590868 -6.102834 -10.619679 ... -9.990441
-1.0927511 -10.764243 ]
[-12.30341 -4.7649236 -6.600345 ... -9.458132
-0.8608778 -12.198781 ]
[-11.649162 -6.2056537 -5.922945 ... -10.207803
-1.5887291 -9.819743 ]]
[[-11.40545 -13.755798 -6.9160714 ... -11.7735195
-3.3357754 -11.139454 ]
[-11.398698 -11.785369 -6.5561953 ... -9.794318
-2.8272014 -11.654141 ]
[ -9.548821 -7.3276024 -8.640192 ... -4.349879
0.14261375 -7.0007625 ]
...
[-12.497658 -5.8748426 -9.083981 ... -9.841493
-1.4732579 -11.357761 ]
[-14.517144 -5.2391934 -8.496638 ... -10.834668
-2.6033173 -13.944796 ]
[-14.292226 -7.0837607 -6.3621516 ... -10.551426
-3.6190045 -12.224428 ]]
...
[[ -6.1242228 -14.730902 -6.034355 ... -5.2220926
-1.1160429 -2.2097938 ]
[ -5.003286 -16.216772 -5.28262 ... -5.2270694
-1.7447093 -4.245701 ]
[ -5.595118 -15.978978 -4.214302 ... -5.4203877
-1.8398296 -4.396698 ]
...
[-13.178917 -13.012176 -10.450902 ... -15.064126
-1.9914117 -9.5184765 ]
[-10.992667 -8.671063 -6.456934 ... -14.054223
-1.4051182 -9.887496 ]
[ -9.728466 -10.335494 -7.3331285 ... -10.754501
-1.7173084 -4.671226 ]]
[[ -5.4983754 -15.449182 -5.7204423 ... -4.4113154
-1.0589103 -2.6990566 ]
[ -5.384841 -16.741693 -5.5674496 ... -5.684756
-1.8891927 -4.65452 ]
[ -5.7909193 -16.244637 -4.5293765 ... -6.4048567
-2.3706574 -4.982708 ]
...
[-10.004818 -11.296059 -7.158481 ... -10.9329
-2.0753372 -8.129092 ]
[ -7.942011 -8.787835 -2.8869028 ... -10.7461605
-1.7351687 -7.8243003 ]
[ -9.368582 -11.195904 -5.3443894 ... -8.967132
-1.5083878 -5.205722 ]]
[[ -7.6940765 -15.492795 -4.6488175 ... -5.7006836
-1.3711176 -3.7699785 ]
[ -5.243174 -15.9268875 -5.07713 ... -3.642994
-1.4748344 -4.1258245 ]
[ -4.8627806 -13.911514 -4.372596 ... -2.4015875
-1.4164882 -3.6560988 ]
...
[ -9.049875 -12.410313 -5.53057 ... -8.292001
-2.442209 -4.6609883 ]
[ -7.18582 -11.061987 -3.3339026 ... -7.413499
-2.0413182 -5.4470387 ]
[ -9.58725 -13.576278 -5.9882216 ... -8.204617
-2.0788593 -5.216848 ]]]]
-------------------------
--- TFLite output ---
[[[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
...
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]]]
---------------------
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 9
- Comments: 24 (6 by maintainers)
Finally I solve the problem by quantizing deep model… probably this problem appears on devices with low processing power (my dev is samsung A50) this is the code: converter = tf.lite.TFLiteConverter.from_keras_model(self.deep_model) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.target_spec.supported_ops = [ tf.lite.OpsSet.TFLITE_BUILTINS # enable TensorFlow Lite ops. ] tflite_model = converter.convert() open(“converted_model.tflite”, “wb”).write(tflite_model)
I faced the same problem too!
After days of investigation, I found the problem is caused by batch norm. The values of feature maps significantly increases (around x100, and sometimes x1e+35) each time passing through the batch norm layer (either slim.batch_norm or tf.nn.fused_batch_norm). Eventually causes the values to become inf or nan (and only nan shown in final output).
I’m not sure if this is a problem for specific tensorflow version? This problem happens to me for both TF 1.11.0 and TF-nightly.
@sercant When freezing a model for inference, the attribute “is_training” of the BN layers should be set as false. In your frozen model, “is_training” is true. That makes the means/variances of BN layers to be all 0s. Maybe you should regenerate a frozen model with “is_training=false” and then convert it to a tflite model.
@hubert0527 Thank you for pointing at batch normalization: when I had removed it from network my outputs became normal (not nan). But of course the quality of my network fell drammatically without batch normalization. I tried to replace tf.layer.batch_normalization with tf.keras.layers.BatchNormalization and tf.contrib.layers.batch_norm, but no effect. Finally I solved the problem by implementing my own batch normalization like this:
Note that this is not literal implementation of batch norm (here moving average is not used), because only train mode was required for my project. Also note that we cannot use tf.nn.moments to calc mean and dev because it is not supported by tflite (so we need to implement own function for moments). After replacing batch normalization with provided functions I was able to train my network, export it to tflite and use it during inference in tflite correctly.