wandb: [CLI] wandb sync fails to upload reports from crashed scripts (AssertionError)
Describe the bug
If I run a script and then terminate it (ex. Ctrl-c) or it crashes for misc. reasons, I cannot use wandb sync to reupload the wandb logs.
$ wandb sync -p <MY-PROJECT> --id <RUN-ID> .
Syncing: https://wandb.ai/<MY-ID>/<MY-PROJECT>/runs/<RUN-ID> ...Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/allabana/.virtualenvs/sttpy/lib/python3.8/site-packages/wandb/sync/sync.py", line 122, in run
data = ds.scan_data()
File "/home/allabana/.virtualenvs/sttpy/lib/python3.8/site-packages/wandb/sdk/internal/datastore.py", line 131, in scan_data
record = self.scan_record()
File "/home/allabana/.virtualenvs/sttpy/lib/python3.8/site-packages/wandb/sdk/internal/datastore.py", line 115, in scan_record
assert checksum == checksum_computed
AssertionError
To Reproduce
I am training a model using PyTorch Lightning. This should be reproducible on any example with PTL.
- Run a train script, passing the wandb logger to the PTL trainer.
from pytorch_lightning.loggers import TensorBoardLogger, WandbLogger
from pytorch_lightning import Trainer
...
# Configure model and data_loader
...
logger = [TensorBoardLogger(save_dir="my_logs", name='lightning_logs')]
logger.append(WandbLogger(project="my_project", log_model=True))
trainer = Trainer(
...
logger=logger
)
trainer.fit(model, data_loader)
- Kill script midrun
- Attempt to reupload logs from wandb directory (mines looks like this)
files
logs
run-<RUN-ID>.wandb
wandb
Expected behavior
Logs should upload.
Desktop (please complete the following information):
Ubuntu 20.04, Python 3.8, wandb 0.10.17
Additional context
Maybe I’m not running the sync command properly? An example in the docs would be really helpful!!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 27 (7 by maintainers)
I think the majority of peoples use cases for wandb sync in this thread are from online runs that are still running but are reported as crashed and never automatically resynchronize.
There was a period of time I was experiencing a large amount of packet loss on my network which made every run go into this state (days worth of training lost and unable to synchronise).
Perhaps it could be useful to take a poll or get some metrics on how the majority of users are running wandb sync? To me it would be more important to not lose data, so if you only plan on supporting offline sync, it would probably mean that all my runs are run offline by default. This kind of reduces the attractiveness of the service to me, and I would no longer be able to enjoy the visibility of my training (through wandb at least) to do things like kill runs that aren’t working out for one reason or another.
Just wanted to offer some additional thoughts you might want to take into consideration.
I have the same problem. Logs for the run: https://gist.github.com/rubencart/7cdc93b66db56ffd55104391c1ac7ad0.
I am having this same issue trying to upload from a run that has crashed according to wandb, on wandb version 0.10.25. More context on my issue: https://github.com/wandb/client/issues/1526#issuecomment-818763611
I am having this issue, tested on version 0.10.21
I wanted to clarify what I meant by not supporting online runs. We want you to always be able to run
wandb syncfor online runs after they have completed (in the case that not all data was streamed to the server). The offline support is only for enabling users to runwandb syncon the offline runs while they’re running.The reason for not supporting calling
wandb syncon a currently running online run is that it introduces multiple threads trying to sync the same data to our backend which we simply can not support.@alek5k we 110% agree that wandb should never lose a users data and we’re currently working to further battle test and harden all code paths in both online and offline mode to ensure users don’t get into a state where data is lost.
Still cannot sync my run with
wandb syncbecause of this AssertionError with wandb version 0.10.28. This run is still running, but marked on wandb.ai as crashed.I am experiencing the same issue. Seems meaningful to have it such that even when the run crashes all previous data points logs could be still reported (except the one interrupted)
@piraka9011 Thank you for the report. This is something we have recently become aware of and have scheduled this for a subsequent release (preliminary target is 0.10.19)
The issue is that the log is incomplete due to being interrupted.
There are a few parts to the fix:
I don’t think support for offline runs makes sense. I would prefer the WandB team invests time in a solution for online runs since that is technically (one of) the main reason(s) to use WandB (i.e the GUI and dashboard).
While the original issue for me was from an online run that crashed, I guess it would be good to re-sync runs that have lost connection, but I don’t think that’s the
syncCLI command’s responsibility. Probably something to fix in the core lib (retrying connections which I believe is already being done).@yhn112 we have plans to fix the case of syncing a run that is in ‘offline’ mode while it is running. This will likely take 1-2 weeks. We do not intend to support running
wandb syncon an ‘online’ run. If running sync on offline runs is your usecase, we’ll keep the ticket updated here.Facing the same issue with
wandb==0.10.26andpython==3.7on Ubuntu 18.04.5 LTS.As discussed in https://github.com/wandb/client/issues/1526#issuecomment-731408684 I think my problem is caused by a (sometimes) unstable internet connection. Training continues when wandb has crashed, but when the training run ends or I terminate it early I get the message
wandb: \ 0.00MB of 0.00MB uploaded (0.00MB deduped)and it hangs until I close the terminal or ctrl+z.Furthermore it is not possible for me to use
wandb syncafterwards to sync the log files that crashed. Here I get the same error as described by @jcoholich