wandb: [CLI]: v0.13.1 Training results in IndexError: pop when using WandbCallback

Describe the bug

Training starts, but never completes due to the bug. At the end of the training phase I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-30-f5faf23168b9> in <module>
      1 results = model.fit(z_train, y_train, batch_size=config.batch_size, epochs=config.epochs, callbacks=callbacks,\
----> 2                     validation_data=(z_test, y_test))

~/.local/lib/python3.6/site-packages/wandb/integration/keras/keras.py in new_v2(*args, **kwargs)
    171             for cbk in cbks:
    172                 set_wandb_attrs(cbk, val_data)
--> 173         return old_v2(*args, **kwargs)
    174 
    175     training_arrays.orig_fit_loop = old_arrays

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
    106   def _method_wrapper(self, *args, **kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self, *args, **kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1144         del self._eval_data_handler
   1145       callbacks.on_train_end(logs=training_logs)
-> 1146       return self.history
   1147 
   1148   def test_step(self, data):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py in __exit__(self, exception_type, exception_value, traceback)
    425                          "tf.distribute.set_strategy() out of `with` scope."),
    426             e)
--> 427     _pop_per_thread_mode()
    428 
    429 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribution_strategy_context.py in _pop_per_thread_mode()
     63 
     64 def _pop_per_thread_mode():
---> 65   ops.get_default_graph()._distribution_strategy_stack.pop(-1)  # pylint: disable=protected-access
     66 
     67 

IndexError: pop from empty list

When I disable WandbCallback() everything works fine. I use 2 GPUs on single machine via MirroredStrategy on tensorflow.

Additional Files

No response

Environment

WandB version: 0.13.1

OS: Ubuntu 18.04.5 LTS

Python version: 3.6.9

Versions of relevant libraries: tensorflow v2.3.1+nv

Additional Context

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15

Most upvoted comments

Hi @alikaanguven , @bermeitinger-b , @jnatale11 . We have identified the regression where this issue is stemming from and have began work on a fix and will provide an update once it’s released.

Hi @bermeitinger-b , @alikaanguven , thank-you both for the feedback. We are still investigating and will work on reproducing soon.

Let me hijack this post. I have the same issue which flags each logged run as failed after it has successfully finished.

  File "/home/user/code/simple_vgg.py", line 158, in <module>                                                                                                                                                        
    run()                                                                                                                                                                                                            
  File "/home/user/code/simple_vgg.py", line 149, in run                                                                                                                                                             
    model.fit(                                                                                                                                                                                                       
  File "/home/user/.local/lib/python3.10/site-packages/wandb/integration/keras/keras.py", line 173, in new_v2                                                                                                        
    return old_v2(*args, **kwargs)                                                                                                                                                                                   
  File "/home/user/.local/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler                                                                                                    
    raise e.with_traceback(filtered_tb) from None                                                                                                                                                                    
  File "/home/user/.local/lib/python3.10/site-packages/tensorflow/python/distribute/distribution_strategy_context.py", line 62, in _pop_per_thread_mode                                                              
    ops.get_default_graph()._distribution_strategy_stack.pop(-1)  # pylint: disable=protected-access                                                                                                                 
IndexError: pop from empty list

My model is a very simple convnet (VGG-like), the layout doesn’t matter, it happens for any model. I guess it has something to do with the strategy to have model parallelism during training.

debug.log
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_setup.py:_flush():76] Configure stats pid to 1
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_setup.py:_flush():76] Loading settings from /home/user/.config/wandb/settings
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_setup.py:_flush():76] Loading settings from /home/user/code/wandb/settings
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'mode': 'offline', '_require_service': 'True'}
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'simple_vgg.py', 'program': '/home/user/code/simple_vgg.py'}
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_init.py:_log_setup():461] Logging user logs to /home/user/code/wandb/offline-run-20220822_130825-3bgspxet/logs/debug.log
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_init.py:_log_setup():462] Logging internal logs to /home/user/code/wandb/offline-run-20220822_130825-3bgspxet/logs/debug-internal.log
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_init.py:init():495] calling init triggers
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_init.py:init():498] wandb.init called with sweep_config: {}
config: {'python': '310', 'tensorflow': '29', 'dataset': 'mnist', 'num_gpus': 1}
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_init.py:init():548] starting backend
2022-08-22 13:08:25,593 INFO    MainThread:1 [wandb_init.py:init():552] setting up manager
2022-08-22 13:08:25,599 INFO    MainThread:1 [backend.py:_multiprocessing_setup():97] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2022-08-22 13:08:25,602 INFO    MainThread:1 [wandb_init.py:init():558] backend started and connected
2022-08-22 13:08:25,605 INFO    MainThread:1 [wandb_init.py:init():636] updated telemetry
2022-08-22 13:08:25,608 INFO    MainThread:1 [wandb_init.py:init():700] starting run threads in backend
2022-08-22 13:08:26,142 INFO    MainThread:1 [wandb_run.py:_console_start():1962] atexit reg
2022-08-22 13:08:26,142 INFO    MainThread:1 [wandb_run.py:_redirect():1820] redirect: SettingsConsole.WRAP_RAW
2022-08-22 13:08:26,142 INFO    MainThread:1 [wandb_run.py:_redirect():1885] Wrapping output streams.
2022-08-22 13:08:26,142 INFO    MainThread:1 [wandb_run.py:_redirect():1907] Redirects installed.
2022-08-22 13:08:26,142 INFO    MainThread:1 [wandb_init.py:init():732] run started, returning control to user process
2022-08-22 13:08:49,053 WARNING MsgRouterThr:1 [router.py:message_loop():76] message_loop has been closed
debug-internal.log
2022-08-22 13:08:25,618 INFO    StreamThr :267 [internal.py:wandb_internal():90] W&B internal server running at pid: 267, started at: 2022-08-22 13:08:25.617826
2022-08-22 13:08:25,619 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: status
2022-08-22 13:08:25,621 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: status
2022-08-22 13:08:25,623 INFO    WriterThread:267 [datastore.py:open_for_write():75] open: /home/user/code/wandb/offline-run-20220822_130825-3bgspxet/run-3bgspxet.wandb
2022-08-22 13:08:25,624 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: run_start
2022-08-22 13:08:26,096 DEBUG   HandlerThread:267 [meta.py:__init__():35] meta init
2022-08-22 13:08:26,096 DEBUG   HandlerThread:267 [meta.py:__init__():50] meta init done
2022-08-22 13:08:26,096 DEBUG   HandlerThread:267 [meta.py:probe():216] probe
2022-08-22 13:08:26,108 DEBUG   HandlerThread:267 [meta.py:_setup_git():206] setup git
2022-08-22 13:08:26,113 DEBUG   HandlerThread:267 [meta.py:_setup_git():213] setup git done
2022-08-22 13:08:26,113 DEBUG   HandlerThread:267 [meta.py:_save_pip():54] save pip
2022-08-22 13:08:26,113 DEBUG   HandlerThread:267 [meta.py:_save_pip():68] save pip done
2022-08-22 13:08:26,114 DEBUG   HandlerThread:267 [meta.py:probe():254] probe done
2022-08-22 13:08:26,143 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: get_summary
2022-08-22 13:08:47,840 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: partial_history
2022-08-22 13:08:47,840 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: partial_history
2022-08-22 13:08:47,845 DEBUG   SenderThread:267 [sender.py:send():302] send: exit
2022-08-22 13:08:47,845 INFO    SenderThread:267 [sender.py:send_exit():438] handling exit code: 1
2022-08-22 13:08:47,845 INFO    SenderThread:267 [sender.py:send_exit():440] handling runtime: 22
2022-08-22 13:08:47,845 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-22 13:08:47,846 INFO    SenderThread:267 [sender.py:_save_file():1156] saving file wandb-summary.json with policy end
2022-08-22 13:08:47,846 INFO    SenderThread:267 [sender.py:send_exit():446] send defer
2022-08-22 13:08:47,846 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: poll_exit
2022-08-22 13:08:47,846 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,846 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 0
2022-08-22 13:08:47,846 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,846 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 0
2022-08-22 13:08:47,846 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 1
2022-08-22 13:08:47,847 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,847 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 1
2022-08-22 13:08:47,934 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,934 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 1
2022-08-22 13:08:47,935 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 2
2022-08-22 13:08:47,935 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,935 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 2
2022-08-22 13:08:47,935 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,935 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 2
2022-08-22 13:08:47,935 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 3
2022-08-22 13:08:47,935 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,935 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 3
2022-08-22 13:08:47,935 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,935 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 3
2022-08-22 13:08:47,936 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 4
2022-08-22 13:08:47,936 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,936 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 4
2022-08-22 13:08:47,936 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,936 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 4
2022-08-22 13:08:47,936 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 5
2022-08-22 13:08:47,936 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,936 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 5
2022-08-22 13:08:47,936 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 5
2022-08-22 13:08:47,936 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 6
2022-08-22 13:08:47,936 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,937 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 6
2022-08-22 13:08:47,937 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,937 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 6
2022-08-22 13:08:47,937 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 7
2022-08-22 13:08:47,937 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,937 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 7
2022-08-22 13:08:47,937 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,937 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 7
2022-08-22 13:08:47,937 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 8
2022-08-22 13:08:47,937 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,937 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 8
2022-08-22 13:08:47,937 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,937 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 8
2022-08-22 13:08:47,937 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 9
2022-08-22 13:08:47,937 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,937 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 9
2022-08-22 13:08:47,938 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,938 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 9
2022-08-22 13:08:47,938 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 10
2022-08-22 13:08:47,938 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,938 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 10
2022-08-22 13:08:47,938 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,938 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 10
2022-08-22 13:08:47,938 INFO    SenderThread:267 [sender.py:transition_state():459] send defer: 11
2022-08-22 13:08:47,938 DEBUG   SenderThread:267 [sender.py:send():302] send: final
2022-08-22 13:08:47,938 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: defer
2022-08-22 13:08:47,938 INFO    HandlerThread:267 [handler.py:handle_request_defer():164] handle defer: 11
2022-08-22 13:08:47,939 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: defer
2022-08-22 13:08:47,939 INFO    SenderThread:267 [sender.py:send_request_defer():455] handle sender defer: 11
2022-08-22 13:08:47,947 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-22 13:08:47,947 DEBUG   SenderThread:267 [sender.py:send_request():316] send_request: poll_exit
2022-08-22 13:08:48,048 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: sampled_history
2022-08-22 13:08:48,049 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: get_summary
2022-08-22 13:08:48,050 INFO    MainThread:267 [wandb_run.py:_footer_history_summary_info():3305] rendering history
2022-08-22 13:08:48,050 INFO    MainThread:267 [wandb_run.py:_footer_history_summary_info():3337] rendering summary
2022-08-22 13:08:48,051 DEBUG   HandlerThread:267 [handler.py:handle_request():141] handle_request: shutdown
2022-08-22 13:08:48,051 INFO    HandlerThread:267 [handler.py:finish():810] shutting down handler
2022-08-22 13:08:48,939 INFO    WriterThread:267 [datastore.py:close():279] close: /home/user/code/wandb/offline-run-20220822_130825-3bgspxet/run-3bgspxet.wandb
2022-08-22 13:08:48,947 INFO    SenderThread:267 [sender.py:finish():1312] shutting down sender
2022-08-22 13:08:49,053 INFO    MainThread:267 [internal.py:handle_exit():80] Internal process exited

This error only appears when creating the model in a model parallel strategy on 1 or more GPUs:

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    model = get_model()

model.fit(...)

(Obviously, removing it is not a workaround when using more than one GPU.)