keras: Evaluate_generator produces wrong accuracy scores?

Hello, I run a slightly modified version of the keras fine tuning examples which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:

img_width, img_height = 150, 150
train_data_dir = 'data/train_s'
validation_data_dir = 'data/val_s'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 10
batch_size = 16

base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))

top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dense(1, activation='sigmoid'))

model = Model(inputs=base_model.input, outputs=top_model(base_model.output))
model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    verbose=2, workers=12)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

correct = 0
for i, n in enumerate(validation_generator.filenames):
    if n.startswith("cats") and scores[i][0] <= 0.5:
        correct += 1
    if n.startswith("dogs") and scores[i][0] > 0.5:
        correct += 1

print("Correct:", correct, " Total: ", len(validation_generator.filenames))
print("Loss: ", score[0], "Accuracy: ", score[1])

With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the rescale=1. / 255 parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.

Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn’t match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 19
Comments: 34 (3 by maintainers)

Most upvoted comments

validation_generator = test_datagen.flow_from_directory(shuffle=True) , the shuffle=True is set by default. I found that using predict_generator(validation_generator) will return a shuffled result while validation_generator.classes and validation_generator.filenames does not shuffle the results… So the accuracy calculated after using predict_generator may cause wrong answer.

+14

xing89qs on Dec 28, 2017

I have encountered this problem today. And I found the solution to this:

You must set shuffle=False in your generator.
you need to reset your generator before calling predict_generator() function. For example:

valid_generator = datagen.flow_from_dataframe(
    dataframe=train_df,
    directory="../images/train/",
    x_col="id",
    y_col="label_2",
    subset="validation",
    batch_size=batch_size,
    seed=42,
    shuffle=False,
    class_mode="categorical",
    classes=classes,
    target_size=(input_shape, input_shape))
step_size_valid = np.ceil(valid_generator.n / valid_generator.batch_size)
model.evaluate_generator(generator=valid_generator, steps=step_size_valid)
...
valid_generator.reset()
model.predict_generator(valid_generator, step_size_valid)

+12

BruceDai003 on Aug 6, 2019

This issue is reproduced regularly while using fit_generator / evaluate_generator, and it seems pretty critical since it makes fit_generator output during training completely useless.

+12

hokmund on Apr 20, 2018

@skoch9 try to set pickle_safe=True. As @joeyearsley mention, for me it had to do with worker > 1. I am actually running on a custom version of Keras where I make evaluate_generator inside fit_generator workers=1. That way I can train with multiple workers but predict/evaluate with a single worker.

@fchollet Please make evaluate_generator, and predict_generator workers=1 always, or eliminate parameter until it is fixed.

Make sure:

shuffle = false
pickle_safe = True
workers = 1

Let me know if that gives you consistent results.

avn3r on May 15, 2017

Hi,

I encountered the same issue recently, and actually the solution is quite simple. You use validation_generator two times in a row, and I imagine your number of samples isn’t exactly divisible by your batch size. Hence, your generator has a shift in its indices after you use it in model.evaluate_generator. So when you call it, the generator won’t yield the sampels in the order you expect.

So you should create a second generator to use in model.predict_generator, or only evaluate your model via evaluate_generator or predict_generator :

validation_generator2 = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator2, nb_validation_samples/batch_size, workers=12)

adilincepto on Jun 14, 2019

You need to add shuffle=False to flow_from_dataframe.

lipinski on Jun 3, 2019

same issue. any simple solution from the official authors yet?

achbogga on Jan 10, 2020

Same issue. Totally confused from the answers above. Any simple solution?

ghasemikasra39 on Jan 5, 2020

I had a similar problem using fit_generator with multiprocessing under Linux: During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again. Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses ‘fork(2)’ to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn’t a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

csandmann on Dec 12, 2017

I too have a big difference between the reported fit_generator results and the later evaluate_generator results.

I’ve looked into this a bit and found the following results:

When i use evaluate_generator with a generator that does not shuffle the suite, i get results that are very different than those reported by fig_generator. But when i use evaluate_generator with a generator that does shuffle the suite, i get results that are similiar to those reported by fit_generator.
When i use evaluate (without any generators) the output is exactly the same as evaluate_generator without shuffling
When i use model.predict and infer the measurements manually, i get the same measurements reported by fit_generator (and the same results as evaluate_generator with shuffling)

Can anyone verify that any of the above happen to them as well?

A note: I use a very simple model - just one dense layer. no dropouts or batch normalizations to create any doubts as mentioned by @jeremydr2

GalAvineri on Dec 4, 2017

To follow up I get the correct result only if I set max_q_size=1

if I only set worker=1 it does not woks either (and give different results each time)

romainVala on Aug 30, 2017

Sure, but it doesn’t make sense to allow invalid parameter configurations. And I would consider it also problematic (->a bug) that running evaluate_generator before predict_generator changes the prediction results.

skoch9 on May 16, 2017