tensorflow: UnboundLocalError: local variable 'logs' referenced before assignment on training with little data
I found an error caused by an attempt of coping training logs from a not yet assigned variable.
The error occurred on my machine (Arch linux, tensorflow v2.2rc2 compiled from source) and I managed to reproduce the error on colab, on a stock environment.
It only happens when the model.fit method is called with very little training / eval data.
The logs variable is assigned inside a for loop that never happens when there is no sufficient data.
The code lives here: https://github.com/tensorflow/tensorflow/blob/e6e5d6df2ab26620548f35bf2e652b19f6d06652/tensorflow/python/keras/engine/training.py#L793
The notebook gist link for reproducing the bug: https://colab.research.google.com/gist/naripok/8ce09ec9c3e795b3635a6b1ac11ebd4b/tpu_transformer_model.ipynb
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 18
- Comments: 38 (3 by maintainers)
@Tuxius and others which stuck until this bug is fixed: I had the same issue and I found that my validation data set had less samples as the batch_size. Because I’m working with TFRecords-data-sets which have no meta data about how many records, i.e. samples, are in the data set I check now upfront whether the data set contains at least batch_size records (samples).
For that I use the following helper functions:
Please keep in mind that by using the first helper function directly , i.e. doesDataSetContainsEnoughDataForBatch, you already read batch_size samples from the data set. So you should recreate the data set after the check.
By using the second helper function you just lose some execution time upfront.
If you don’t use TFRecord data sets you might have a similar issue. But then it is also a good idea to check upfront whether you have enough data for at least one batch.
2.2 is still reproducible for me
check number of your training data. It might equal to zero
Same problem: Is there way to fix or alternatives ?
I have the same issue with TF v2.2.0-rc2
I solved my particular issue setting the batch_size very small, in my case batch_size = 1. I confirm that this happens with small datasets.
Are there any other workarounds known yet? The batch size one did not work out for me.
This is reproducible in TF 2.3. I am using batch_size of 1 and no validation data. By the way, this is working when I ran my custom model eager mode
run_eagerly=TrueSame problem, the root cause for this issue is that the training cannot perform a single step because your dataset is not large enough to fill one training iteration.
Anyway, there is still a bug here, this is not the expected behavior.
Looks like this happens when size of the training data is <= batch size
Encountered this error, myself, and thought I’d share my debugging in case others would benefit in some way.
Tracing the calls and returns, it seems that inside
fit, a local variablelogsis first defined inside a for loop iterating overDataHandler.steps(), which, in many cases, is a number dependent on the cardinality of the Dataset. If the Dataset cardinality is zero, thenDataHandler.steps()performs 0 iterations, andlogswill never be created, resulting in the UnboundLocalError we’re all getting.The cardinality of my Dataset was being returned as 0 because I had created it from a
.take()called on aBatchDatasetthat had a batch_size > 1 – this was an artifact of a previous iteration of my code. (Worth noting that when.take()was called on theBatchDatasetwith batch_size of 1, then myDataset.cardinality()was nonzero.)After cleaning the code (in my case, calling
takeon a Dataset that had not been batched), my Dataset.cardinality() returned a nonzero value, and the error was resolved.I had the same error with using
model.fitit turns out thevalidation_stepsorsteps_per_epochwas zero. By making them >=1, the bug disappeared.I am surprised there is no self-contained code example reproducing this, so here is one:
This error also raise when len(traindata) or len(validdata) is 0.
In my case, I read my tfrecords incorrectly but didn’t get any warnings or errors. After reading the record, you can try printing its size as a sanity check.
same problem during training BiT. Batch size 128, Steps_per_epoch = 100 Edit: I throw out validation_data and it works. but I want valid model