tensorflow: UnboundLocalError: local variable 'logs' referenced before assignment on training with little data

I found an error caused by an attempt of coping training logs from a not yet assigned variable. The error occurred on my machine (Arch linux, tensorflow v2.2rc2 compiled from source) and I managed to reproduce the error on colab, on a stock environment. It only happens when the model.fit method is called with very little training / eval data. The logs variable is assigned inside a for loop that never happens when there is no sufficient data.

The code lives here: https://github.com/tensorflow/tensorflow/blob/e6e5d6df2ab26620548f35bf2e652b19f6d06652/tensorflow/python/keras/engine/training.py#L793

The notebook gist link for reproducing the bug: https://colab.research.google.com/gist/naripok/8ce09ec9c3e795b3635a6b1ac11ebd4b/tpu_transformer_model.ipynb

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 18
  • Comments: 38 (3 by maintainers)

Most upvoted comments

@Tuxius and others which stuck until this bug is fixed: I had the same issue and I found that my validation data set had less samples as the batch_size. Because I’m working with TFRecords-data-sets which have no meta data about how many records, i.e. samples, are in the data set I check now upfront whether the data set contains at least batch_size records (samples).

For that I use the following helper functions:

def doesDataSetContainsEnoughDataForBatch(dataset, batch_size):
    return len(list(dataset.take(batch_size).as_numpy_iterator())) == batch_size
  
def doesDataSetFileContainsEnoughDataForBatch(sampleFileName="", batch_size=100):
    dataset = tf.data.TFRecordDataset(sampleFileName)
    return doesDataSetContainsEnoughDataForBatch(dataset, batch_size=batch_size)

if __name__ == '__main__':
  dataSetFileName="./Samples/validation_data_123.tfrecord"
  if not doesDataSetFileContainsEnoughDataForBatch(dataSetFileName,batch_size=100):
    raise Exception(f"Data set file {dataSetFileName} doesn't contain enough data")
   
  # Now open the data set a second time knowing you have enough data 
  # and use it ....
  """
  trainDataset = tf.data.TFRecordDataset(dataSetFileName)
  
  model.fit ( trainDataset ... 
  
  """

Please keep in mind that by using the first helper function directly , i.e. doesDataSetContainsEnoughDataForBatch, you already read batch_size samples from the data set. So you should recreate the data set after the check.

By using the second helper function you just lose some execution time upfront.

If you don’t use TFRecord data sets you might have a similar issue. But then it is also a good idea to check upfront whether you have enough data for at least one batch.

2.2 is still reproducible for me

check number of your training data. It might equal to zero

Same problem: Is there way to fix or alternatives ?

I have the same issue with TF v2.2.0-rc2

Epoch 19/20
18/18 [==============================] - 2s 125ms/step - loss: 0.0035 - accuracy: 1.0000
Epoch 20/20
18/18 [==============================] - 2s 126ms/step - loss: 0.0373 - accuracy: 0.9844

Traceback (most recent call last):
  File "run.py", line 155, in <module>
    main()
  File "run.py", line 120, in main
    accuracy, num_of_classes = train_Full_visible(unique_name)
  File "run.py", line 78, in train_Full_visible
    acc = neuro.train(picdb, train_ids, test_ids, "Full body visible")
  File "/ssd/200410 3rd Try/neuro.py", line 232, in train
    test_loss, test_acc = self.model.evaluate(test_generator, verbose=0)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/frank/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1028, in evaluate
    logs = tf_utils.to_numpy_or_python_type(logs)
UnboundLocalError: local variable 'logs' referenced before assignment

I solved my particular issue setting the batch_size very small, in my case batch_size = 1. I confirm that this happens with small datasets.

Are there any other workarounds known yet? The batch size one did not work out for me.

This is reproducible in TF 2.3. I am using batch_size of 1 and no validation data. By the way, this is working when I ran my custom model eager mode run_eagerly=True

Same problem, the root cause for this issue is that the training cannot perform a single step because your dataset is not large enough to fill one training iteration.

Anyway, there is still a bug here, this is not the expected behavior.

Looks like this happens when size of the training data is <= batch size

Encountered this error, myself, and thought I’d share my debugging in case others would benefit in some way.

Tracing the calls and returns, it seems that inside fit, a local variable logs is first defined inside a for loop iterating over DataHandler.steps(), which, in many cases, is a number dependent on the cardinality of the Dataset. If the Dataset cardinality is zero, then DataHandler.steps() performs 0 iterations, and logs will never be created, resulting in the UnboundLocalError we’re all getting.

The cardinality of my Dataset was being returned as 0 because I had created it from a .take() called on a BatchDataset that had a batch_size > 1 – this was an artifact of a previous iteration of my code. (Worth noting that when .take() was called on the BatchDataset with batch_size of 1, then my Dataset.cardinality() was nonzero.)

After cleaning the code (in my case, calling take on a Dataset that had not been batched), my Dataset.cardinality() returned a nonzero value, and the error was resolved.

I had the same error with using model.fit it turns out the validation_steps or steps_per_epoch was zero. By making them >=1, the bug disappeared.

I am surprised there is no self-contained code example reproducing this, so here is one:

import numpy as np
from tensorflow import keras
model = keras.models.Sequential([keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss="mse", optimizer="adam")
model.fit(x=np.ones((1, 1)), y=np.ones((1, 1)), validation_split=0.5)

This error also raise when len(traindata) or len(validdata) is 0.

check number of your training data. It might equal to zero

In my case, I read my tfrecords incorrectly but didn’t get any warnings or errors. After reading the record, you can try printing its size as a sanity check.

same problem during training BiT. Batch size 128, Steps_per_epoch = 100 Edit: I throw out validation_data and it works. but I want valid model