keras: Evaluate_generator produces wrong accuracy scores?

Hello, I run a slightly modified version of the keras fine tuning examples which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:

img_width, img_height = 150, 150
train_data_dir = 'data/train_s'
validation_data_dir = 'data/val_s'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 10
batch_size = 16

base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))

top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dense(1, activation='sigmoid'))

model = Model(inputs=base_model.input, outputs=top_model(base_model.output))
model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    verbose=2, workers=12)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

correct = 0
for i, n in enumerate(validation_generator.filenames):
    if n.startswith("cats") and scores[i][0] <= 0.5:
        correct += 1
    if n.startswith("dogs") and scores[i][0] > 0.5:
        correct += 1

print("Correct:", correct, " Total: ", len(validation_generator.filenames))
print("Loss: ", score[0], "Accuracy: ", score[1])

With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the rescale=1. / 255 parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.

Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn’t match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 19
  • Comments: 34 (3 by maintainers)

Most upvoted comments

validation_generator = test_datagen.flow_from_directory(shuffle=True) , the shuffle=True is set by default. I found that using predict_generator(validation_generator) will return a shuffled result while validation_generator.classes and validation_generator.filenames does not shuffle the results… So the accuracy calculated after using predict_generator may cause wrong answer.

I have encountered this problem today. And I found the solution to this:

  1. You must set shuffle=False in your generator.
  2. you need to reset your generator before calling predict_generator() function. For example:
valid_generator = datagen.flow_from_dataframe(
    dataframe=train_df,
    directory="../images/train/",
    x_col="id",
    y_col="label_2",
    subset="validation",
    batch_size=batch_size,
    seed=42,
    shuffle=False,
    class_mode="categorical",
    classes=classes,
    target_size=(input_shape, input_shape))
step_size_valid = np.ceil(valid_generator.n / valid_generator.batch_size)
model.evaluate_generator(generator=valid_generator, steps=step_size_valid)
...
valid_generator.reset()
model.predict_generator(valid_generator, step_size_valid)

This issue is reproduced regularly while using fit_generator / evaluate_generator, and it seems pretty critical since it makes fit_generator output during training completely useless.

@skoch9 try to set pickle_safe=True. As @joeyearsley mention, for me it had to do with worker > 1. I am actually running on a custom version of Keras where I make evaluate_generator inside fit_generator workers=1. That way I can train with multiple workers but predict/evaluate with a single worker.

@fchollet Please make evaluate_generator, and predict_generator workers=1 always, or eliminate parameter until it is fixed.

Make sure:

  1. shuffle = false
  2. pickle_safe = True
  3. workers = 1

Let me know if that gives you consistent results.

Hi,

I encountered the same issue recently, and actually the solution is quite simple. You use validation_generator two times in a row, and I imagine your number of samples isn’t exactly divisible by your batch size. Hence, your generator has a shift in its indices after you use it in model.evaluate_generator. So when you call it, the generator won’t yield the sampels in the order you expect.

So you should create a second generator to use in model.predict_generator, or only evaluate your model via evaluate_generator or predict_generator :

validation_generator2 = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator2, nb_validation_samples/batch_size, workers=12)

You need to add shuffle=False to flow_from_dataframe.

same issue. any simple solution from the official authors yet?

Same issue. Totally confused from the answers above. Any simple solution?

I had a similar problem using fit_generator with multiprocessing under Linux: During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again. Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses ‘fork(2)’ to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn’t a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

I too have a big difference between the reported fit_generator results and the later evaluate_generator results.

I’ve looked into this a bit and found the following results:

  1. When i use evaluate_generator with a generator that does not shuffle the suite, i get results that are very different than those reported by fig_generator. But when i use evaluate_generator with a generator that does shuffle the suite, i get results that are similiar to those reported by fit_generator.

  2. When i use evaluate (without any generators) the output is exactly the same as evaluate_generator without shuffling

  3. When i use model.predict and infer the measurements manually, i get the same measurements reported by fit_generator (and the same results as evaluate_generator with shuffling)

Can anyone verify that any of the above happen to them as well?

A note: I use a very simple model - just one dense layer. no dropouts or batch normalizations to create any doubts as mentioned by @jeremydr2

To follow up I get the correct result only if I set max_q_size=1

if I only set worker=1 it does not woks either (and give different results each time)

Sure, but it doesn’t make sense to allow invalid parameter configurations. And I would consider it also problematic (->a bug) that running evaluate_generator before predict_generator changes the prediction results.