wandb: [CLI]: v0.13.1 Training results in IndexError: pop when using WandbCallback
Describe the bug
Training starts, but never completes due to the bug. At the end of the training phase I get the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-30-f5faf23168b9> in <module>
1 results = model.fit(z_train, y_train, batch_size=config.batch_size, epochs=config.epochs, callbacks=callbacks,\
----> 2 validation_data=(z_test, y_test))
~/.local/lib/python3.6/site-packages/wandb/integration/keras/keras.py in new_v2(*args, **kwargs)
171 for cbk in cbks:
172 set_wandb_attrs(cbk, val_data)
--> 173 return old_v2(*args, **kwargs)
174
175 training_arrays.orig_fit_loop = old_arrays
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
106 def _method_wrapper(self, *args, **kwargs):
107 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
--> 108 return method(self, *args, **kwargs)
109
110 # Running inside `run_distribute_coordinator` already.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1144 del self._eval_data_handler
1145 callbacks.on_train_end(logs=training_logs)
-> 1146 return self.history
1147
1148 def test_step(self, data):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py in __exit__(self, exception_type, exception_value, traceback)
425 "tf.distribute.set_strategy() out of `with` scope."),
426 e)
--> 427 _pop_per_thread_mode()
428
429
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribution_strategy_context.py in _pop_per_thread_mode()
63
64 def _pop_per_thread_mode():
---> 65 ops.get_default_graph()._distribution_strategy_stack.pop(-1) # pylint: disable=protected-access
66
67
IndexError: pop from empty list
When I disable WandbCallback() everything works fine. I use 2 GPUs on single machine via MirroredStrategy on tensorflow.
Additional Files
No response
Environment
WandB version: 0.13.1
OS: Ubuntu 18.04.5 LTS
Python version: 3.6.9
Versions of relevant libraries: tensorflow v2.3.1+nv
Additional Context
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15
Hi @alikaanguven , @bermeitinger-b , @jnatale11 . We have identified the regression where this issue is stemming from and have began work on a fix and will provide an update once it’s released.
Hi @bermeitinger-b , @alikaanguven , thank-you both for the feedback. We are still investigating and will work on reproducing soon.
Let me hijack this post. I have the same issue which flags each logged run as failed after it has successfully finished.
My model is a very simple convnet (VGG-like), the layout doesn’t matter, it happens for any model. I guess it has something to do with the
strategyto have model parallelism during training.debug.log
debug-internal.log
This error only appears when creating the model in a model parallel strategy on 1 or more GPUs:
(Obviously, removing it is not a workaround when using more than one GPU.)