keras: BatchNormalization Layer gives inconsistent output for model.predict()
I am currently on Keras 2.2.4 and Tensorflow 1.12.0. This issue was also observed on Keras 2.1.6 with TF 1.8.0.
So I have a UNet with batchnorm trained on my dataset. After done training, I use the model to predict segmentation output from unseen images.
soft_predictions = self.inference_model.predict(np.vstack(images))
Sometimes I pass multiple images at a time, but sometimes it could be just one image. I notice that the segmentation output of image A differs between two cases: (1) if I pass image A with other images; and (2) if I pass only image A.
With other images:
On its own:
It might not be too obvious here, but the values of the pixels are different. Also, please excuse the performance of the network. It was trained only on a few images with a few iterations, but this is not a problem. The inconsistency is observed on well-trained networks too.
Here are some other things that I have experimented on:
(1) Removing batch normalization from the network remedies this issue. The segmentation output is consistent from both scenarios. So I think I can safely say that the source of the issue is the BatchNorm layer. However, not using BtachNorm is not an option.
(2) I have also tried to set layer.trainable = False
for all layers in my inference_model, to no avail.
(3) Also tried to set layer._per_input_updates = {}
to all BatchNorm layers in inference_model, still no avail.
(4) Setting training=False
when calling the BatchNorm layers in inference_model
makes the network gives all 1.0 or 0.0 output.
If anybody could give me an idea of how to solve this problem, it would be much much appreciated. This issue is really annoying because it makes evaluation and putting the model into production very difficult.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 16
- Comments: 30
I believe I’ve figured it out, but no one will like the verdict, myself included: the problem is numeric precision, and
BatchNormalization
has nothing to do with it. It’s a work in progress, but I’ve narrowed down the exact operation responsible:np.dot
- you can follow the work here. Will update later with all relevant testing code, and a solution. I’ve also made OP’s images into a gif for direct comparison:(P.S. removing
BatchNormalization
in your tests may “solve” the problem, but it’s not BN that’s problematic, but how it transforms input tensor dimensionalities - will clarify later)I’ve run into this issue as well.
Re-using an identical sample for training and validation, the model gets 100% accuracy while training but validation accuracy oscillates around 60%. Setting momentum to 0.0001 so that the “moving mean/variance” == “last batch mean/variance” did not fix the validation accuracy, so BatchNorm must be doing something else which is modifying the data at Validation time.
I’m using a TimeDistributed layer which might make my case unique, but here’s the issue anyway with a reproducible example just in case: https://github.com/tensorflow/tensorflow/issues/30109
I’m using tensorflow-gpu 1.13, Keras 2.2.4
Yes, I can reproduce the bug with this code. The rows are not equal. I have also commented on your original thread on TF. Thank you.
I trained the model with a batch_size of 20.
I written an simple example to replicate the issue (batch_size of 40), the full code is attached in BN2.zip (I am using Keras 2.2.0)
I tested with fcnn , a UNET-like architecture with BatchNorm and fcnn_no_batch_normalization which is the same network without BatchNorm.
The output with fcnn was
and the output with fcnn_no_batch_normalization was:
In this code the differences are small, but I have a more complex network that the differences are larger (0.1 - 0.5 in accuracy on similar dummy data)
If anyone could help me out that would be great!
@twinanda You’re welcome. On cats - sounds like overfitting, contrary to intuition; the test set isn’t a universal benchmark of your neural net. Part of ability to generalize includes robustness to noise - which is often omitted from explicit testing. It’s what roots “adversarial attacks” - in the extreme example, the “one-pixel attack”, where a single pixel in an image is (intelligently) manipulated to trick the (well-trained) NN to think that a cat is a motorboat.
… or it’s a bug. Can’t tell much without model code and at least dataset info (shapes, quality, noise, etc). If you’d like, I can have a look if you post a minimally-reproducible example on StackOverflow.
In the meantime, I have a request specific to you, which does sort of “leak” the crux of my investigation: right after importing backend as
K
, runK.set_floatx('float64')
, and rerun the exact code used to generate your original greyscale image. Let me know if the ultimate difference is nearly as dramatic.All,
I have created a very small test fixture which reproduces the error. Can you verify you get the bug? @twinanda, @LukeBolly
Still no updates from my side. This issue is very pronounced when you are doing segmentation on an image in which the interesting structure is only a small part of the image. It seems like it is trying to normalize the output, which results in a high false positive rate.
I’m having same issue along with https://github.com/tensorflow/tensorboard/issues/1514 which seems to be keras bug rather than tf. Is there any keras version where this is working? Need it asap.