keras: BatchNormalization Layer gives inconsistent output for model.predict()

I am currently on Keras 2.2.4 and Tensorflow 1.12.0. This issue was also observed on Keras 2.1.6 with TF 1.8.0.

So I have a UNet with batchnorm trained on my dataset. After done training, I use the model to predict segmentation output from unseen images.

soft_predictions = self.inference_model.predict(np.vstack(images))

Sometimes I pass multiple images at a time, but sometimes it could be just one image. I notice that the segmentation output of image A differs between two cases: (1) if I pass image A with other images; and (2) if I pass only image A.

With other images: test_image_generate_output

On its own: test_image_analyse

It might not be too obvious here, but the values of the pixels are different. Also, please excuse the performance of the network. It was trained only on a few images with a few iterations, but this is not a problem. The inconsistency is observed on well-trained networks too.

Here are some other things that I have experimented on: (1) Removing batch normalization from the network remedies this issue. The segmentation output is consistent from both scenarios. So I think I can safely say that the source of the issue is the BatchNorm layer. However, not using BtachNorm is not an option. (2) I have also tried to set layer.trainable = False for all layers in my inference_model, to no avail. (3) Also tried to set layer._per_input_updates = {} to all BatchNorm layers in inference_model, still no avail. (4) Setting training=False when calling the BatchNorm layers in inference_model makes the network gives all 1.0 or 0.0 output.

If anybody could give me an idea of how to solve this problem, it would be much much appreciated. This issue is really annoying because it makes evaluation and putting the model into production very difficult.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 16
  • Comments: 30

Most upvoted comments

I believe I’ve figured it out, but no one will like the verdict, myself included: the problem is numeric precision, and BatchNormalization has nothing to do with it. It’s a work in progress, but I’ve narrowed down the exact operation responsible: np.dot - you can follow the work here. Will update later with all relevant testing code, and a solution. I’ve also made OP’s images into a gif for direct comparison:

(P.S. removing BatchNormalization in your tests may “solve” the problem, but it’s not BN that’s problematic, but how it transforms input tensor dimensionalities - will clarify later)

I’ve run into this issue as well.

Re-using an identical sample for training and validation, the model gets 100% accuracy while training but validation accuracy oscillates around 60%. Setting momentum to 0.0001 so that the “moving mean/variance” == “last batch mean/variance” did not fix the validation accuracy, so BatchNorm must be doing something else which is modifying the data at Validation time.

I’m using a TimeDistributed layer which might make my case unique, but here’s the issue anyway with a reproducible example just in case: https://github.com/tensorflow/tensorflow/issues/30109

I’m using tensorflow-gpu 1.13, Keras 2.2.4

All,

I have created a very small test fixture which reproduces the error. Can you verify you get the bug? @twinanda, @LukeBolly

    import tensorflow as tf
    import numpy as np
    input1 = tf.keras.layers.Input(shape=(128,128,1))
    x = tf.keras.layers.BatchNormalization()(input1)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(5, activation = 'softmax')(x)
    
    model = tf.keras.models.Model(input1, x)

    batchSize = 32
    x = np.random.rand(batchSize, 128,128,1)
    
    print('The following rows should all be equal...')
    for k in range(1, batchSize):  
        y = model.predict(x[0:k,:,:,:])
        print(y[0,:])

Yes, I can reproduce the bug with this code. The rows are not equal. I have also commented on your original thread on TF. Thank you.

I trained the model with a batch_size of 20.

I written an simple example to replicate the issue (batch_size of 40), the full code is attached in BN2.zip (I am using Keras 2.2.0)

I tested with fcnn , a UNET-like architecture with BatchNorm and fcnn_no_batch_normalization which is the same network without BatchNorm.

    model = fcnn(47,47,47,2)
    #model = fcnn_no_batch_normalization(47, 47, 47, 2)

    model.summary(line_length=113)

    sgd = SGD(lr=0.01, decay=0, momentum=0.85, nesterov=False)
    model.compile(optimizer=sgd,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'],
                  sample_weight_mode='temporal')

    # dummy data fcnn
    n_samples = 1000
    out_size = 43

    # randomly generated inputs
    imgs_train = np.float32(randint(0, 1000, (n_samples, 47, 47, 47, 1)))

    # get output as being the pixels with intensity above 800 in a noisy version of imgs_train
    msks_train  = np.zeros((n_samples, out_size**3, 2))
    imgs_train2 = imgs_train + randint(0, 500, (n_samples, 47, 47, 47, 1)) # imgs_train + noise

    crop = (2,2)
    imgs_train2_crop  = (imgs_train2[:,crop[0]:-crop[1],crop[0]:-crop[1],crop[0]:-crop[1],0] > 800)
    msks_train[...,1] = imgs_train2_crop.reshape((n_samples, out_size**3))
    msks_train[...,0] = 1-imgs_train2_crop.reshape((n_samples, out_size**3))

    model.fit(imgs_train,
              msks_train,
              epochs=5,
              batch_size=40,
              verbose=True,
              shuffle=True)

    # predict accuracy for different batch sizes
    batchSizes = [1,5,32,53,98]
    for i in batchSizes:
        print ('batch size :', i, 'accuracy :', accu(msks_train, model.predict(imgs_train, batch_size=i)) ) 

The output with fcnn was

batch size : 1 accuracy : 0.8674336976618411
batch size : 5 accuracy : 0.86743371023935
batch size : 32 accuracy : 0.8674336976618411
batch size : 53 accuracy : 0.86743371023935
batch size : 98 accuracy : 0.86743371023935

and the output with fcnn_no_batch_normalization was:

batch size : 1 accuracy : 0.4484741343529501
batch size : 5 accuracy : 0.4484741343529501
batch size : 32 accuracy : 0.4484741343529501
batch size : 53 accuracy : 0.4484741343529501
batch size : 98 accuracy : 0.4484741343529501

In this code the differences are small, but I have a more complex network that the differences are larger (0.1 - 0.5 in accuracy on similar dummy data)

If anyone could help me out that would be great!

@twinanda You’re welcome. On cats - sounds like overfitting, contrary to intuition; the test set isn’t a universal benchmark of your neural net. Part of ability to generalize includes robustness to noise - which is often omitted from explicit testing. It’s what roots “adversarial attacks” - in the extreme example, the “one-pixel attack”, where a single pixel in an image is (intelligently) manipulated to trick the (well-trained) NN to think that a cat is a motorboat.

… or it’s a bug. Can’t tell much without model code and at least dataset info (shapes, quality, noise, etc). If you’d like, I can have a look if you post a minimally-reproducible example on StackOverflow.

In the meantime, I have a request specific to you, which does sort of “leak” the crux of my investigation: right after importing backend as K, run K.set_floatx('float64'), and rerun the exact code used to generate your original greyscale image. Let me know if the ultimate difference is nearly as dramatic.

All,

I have created a very small test fixture which reproduces the error. Can you verify you get the bug? @twinanda, @LukeBolly

    import tensorflow as tf
    import numpy as np
    input1 = tf.keras.layers.Input(shape=(128,128,1))
    x = tf.keras.layers.BatchNormalization()(input1)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(5, activation = 'softmax')(x)
    
    model = tf.keras.models.Model(input1, x)

    batchSize = 32
    x = np.random.rand(batchSize, 128,128,1)
    
    print('The following rows should all be equal...')
    for k in range(1, batchSize):  
        y = model.predict(x[0:k,:,:,:])
        print(y[0,:])

Still no updates from my side. This issue is very pronounced when you are doing segmentation on an image in which the interesting structure is only a small part of the image. It seems like it is trying to normalize the output, which results in a high false positive rate.

I’m having same issue along with https://github.com/tensorflow/tensorboard/issues/1514 which seems to be keras bug rather than tf. Is there any keras version where this is working? Need it asap.