wandb: [CLI] wandb sync fails to upload reports from crashed scripts (AssertionError)

Describe the bug

If I run a script and then terminate it (ex. Ctrl-c) or it crashes for misc. reasons, I cannot use wandb sync to reupload the wandb logs.

$ wandb sync -p <MY-PROJECT> --id <RUN-ID> .

Syncing: https://wandb.ai/<MY-ID>/<MY-PROJECT>/runs/<RUN-ID> ...Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/allabana/.virtualenvs/sttpy/lib/python3.8/site-packages/wandb/sync/sync.py", line 122, in run
    data = ds.scan_data()
  File "/home/allabana/.virtualenvs/sttpy/lib/python3.8/site-packages/wandb/sdk/internal/datastore.py", line 131, in scan_data
    record = self.scan_record()
  File "/home/allabana/.virtualenvs/sttpy/lib/python3.8/site-packages/wandb/sdk/internal/datastore.py", line 115, in scan_record
    assert checksum == checksum_computed
AssertionError

To Reproduce

I am training a model using PyTorch Lightning. This should be reproducible on any example with PTL.

  1. Run a train script, passing the wandb logger to the PTL trainer.
from pytorch_lightning.loggers import TensorBoardLogger, WandbLogger
from pytorch_lightning import Trainer
...
# Configure model and data_loader
...
logger = [TensorBoardLogger(save_dir="my_logs", name='lightning_logs')]
logger.append(WandbLogger(project="my_project", log_model=True))

trainer = Trainer(
        ...
        logger=logger
    )
trainer.fit(model, data_loader)
  1. Kill script midrun
  2. Attempt to reupload logs from wandb directory (mines looks like this)
files
logs
run-<RUN-ID>.wandb
wandb

Expected behavior

Logs should upload.

Desktop (please complete the following information):

Ubuntu 20.04, Python 3.8, wandb 0.10.17

Additional context

Maybe I’m not running the sync command properly? An example in the docs would be really helpful!!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 27 (7 by maintainers)

Most upvoted comments

I think the majority of peoples use cases for wandb sync in this thread are from online runs that are still running but are reported as crashed and never automatically resynchronize.

There was a period of time I was experiencing a large amount of packet loss on my network which made every run go into this state (days worth of training lost and unable to synchronise).

Perhaps it could be useful to take a poll or get some metrics on how the majority of users are running wandb sync? To me it would be more important to not lose data, so if you only plan on supporting offline sync, it would probably mean that all my runs are run offline by default. This kind of reduces the attractiveness of the service to me, and I would no longer be able to enjoy the visibility of my training (through wandb at least) to do things like kill runs that aren’t working out for one reason or another.

Just wanted to offer some additional thoughts you might want to take into consideration.

I am having this same issue trying to upload from a run that has crashed according to wandb, on wandb version 0.10.25. More context on my issue: https://github.com/wandb/client/issues/1526#issuecomment-818763611

I am having this issue, tested on version 0.10.21

I wanted to clarify what I meant by not supporting online runs. We want you to always be able to run wandb sync for online runs after they have completed (in the case that not all data was streamed to the server). The offline support is only for enabling users to run wandb sync on the offline runs while they’re running.

The reason for not supporting calling wandb sync on a currently running online run is that it introduces multiple threads trying to sync the same data to our backend which we simply can not support.

@alek5k we 110% agree that wandb should never lose a users data and we’re currently working to further battle test and harden all code paths in both online and offline mode to ensure users don’t get into a state where data is lost.

Still cannot sync my run with wandb sync because of this AssertionError with wandb version 0.10.28. This run is still running, but marked on wandb.ai as crashed.

I am experiencing the same issue. Seems meaningful to have it such that even when the run crashes all previous data points logs could be still reported (except the one interrupted)

@piraka9011 Thank you for the report. This is something we have recently become aware of and have scheduled this for a subsequent release (preliminary target is 0.10.19)

The issue is that the log is incomplete due to being interrupted.

There are a few parts to the fix:

  1. minimize the data that might not be flushed to the disk at the time of the control-c (time and data size based flush)
  2. try harder at control-c time to close the file more gracefully
  3. let wandb sync handle errors at end of stream
  4. give an override that allows sync to ignore all errors and make as much progress as possible

I don’t think support for offline runs makes sense. I would prefer the WandB team invests time in a solution for online runs since that is technically (one of) the main reason(s) to use WandB (i.e the GUI and dashboard).

While the original issue for me was from an online run that crashed, I guess it would be good to re-sync runs that have lost connection, but I don’t think that’s the sync CLI command’s responsibility. Probably something to fix in the core lib (retrying connections which I believe is already being done).

@yhn112 we have plans to fix the case of syncing a run that is in ‘offline’ mode while it is running. This will likely take 1-2 weeks. We do not intend to support running wandb sync on an ‘online’ run. If running sync on offline runs is your usecase, we’ll keep the ticket updated here.

Facing the same issue with wandb==0.10.26 and python==3.7 on Ubuntu 18.04.5 LTS.

As discussed in https://github.com/wandb/client/issues/1526#issuecomment-731408684 I think my problem is caused by a (sometimes) unstable internet connection. Training continues when wandb has crashed, but when the training run ends or I terminate it early I get the message wandb: \ 0.00MB of 0.00MB uploaded (0.00MB deduped) and it hangs until I close the terminal or ctrl+z.

Furthermore it is not possible for me to use wandb sync afterwards to sync the log files that crashed. Here I get the same error as described by @jcoholich