wandb: [App] wandb sync for offline runs doesn't upload all data until run is finished.
Hi,
I’ve been using the wandb sync to upload offline runs lately, using os.environ['WANDB_MODE'] = 'dryrun'.
Somehow, if the run isn’t completed, all of my arguments (uploaded with wandb.config.update(args)) won’t appear till the end of the run.
A similar behaviour is happening when I upload metrics: they appear in the figures, but the data only appears in the table at the end of the run.
Let me know if you need more information about my setup and/or code.
Thanks!
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 6
- Comments: 19 (1 by maintainers)
I faced the same issue.
I run the code on the Slurm platform with time limitations. After the dead time comes, the whole run is crushed immediately. After that, I run
wandb sync ./wandb/offline-run-xxxand the figures of metrics are shown in GUI normally, however, the table contains nothing about either config parameters and metrics.The strange thing is that I also run
wandb sync --view --verbose wandb/offline-run-xxx.wandband I found the config parameters are actually already contained in thexxx.wandbfile. But thefiles/config.yamlfile only contains the system parameters like software version information without the needed experiments’ parameters.So I think this issue might be fixed by allowing upload the config parameters saved in
xxx.wandbfile to the GUI table? It is so inconvenient to check thexxx.wandbfile for obtaining the config parameters after offline runs crushed.HI @yihong-chen, currently the ticket is in our queue to do, but I don’t have a timeline for it yet
Hi @vanpelt. I have created a minimal example to explain the issue. I used
signal.SIGTERMto mimic slurm behavior when the time limit is reached. I ran the following code two times, the second time withos.environ["WANDB_MODE"] = "offline"commented.After syncing an offline run the Table page is as follows.
The output of the tree command is as follows.
Note that the arguments
seedandepochsappear correctly in bothwandb/run-20220328_093725-20tgsy55/run-20tgsy55.wandbandwandb/offline-run-20220328_093814-1q6a1n1p/run-1q6a1n1p.wandb.Finally, I am using
wandb==0.12.9.Thank you.
I am also a computecanada user and run codes on slurm platform. Slurm terminates a job once we exhaust the time limit. Oftentimes, this happens before the training is complete or
wandb.finish()is called. Since the compute node doesn’t have an internet connection, we use an offline mode for wandb and usewandb syncto sync an incomplete offline run. We cannot see config in wandb overview and Table tabs in this particular case. However, the config is stored in therun-xxxxxxxx.wandbfile.