wandb: [CLI]: Offline run folders appear without required information and can't be synced

Describe the bug

Offline runs don’t always log the information needed for them to be synced, so the information they generate is effectively lost. This happens for around 50% of the offline runs on my cluster, so it’s a large issue. So far, I haven’t found any factor that correlates with a run working or not. When I run the same code but as an online run, everything works as expected.

More precisely, when the bug occurs, the offline-run-... folder created in the wandb directory is missing files—see in additional files for details. When I try to log it with wandb sync, on one such folder, I get

(py310IMLESSL) [tme3@narval3 wandb]$ wandb sync offline-run-20230517_061457-bz3nyzr9
Find logs at: /tmp/debug-cli.tme3.log
Skipping directory: /lustre06/project/6061877/tme3/IMLE-SSL/wandb/offline-run-20230517_061457-bz3nyzr9

The relevant logs are apparently empty. I also don’t find mention of the run in debug-cli.tme3.log, while the UIDs of runs that did work correctly do appear there.

I would very much appreciate help solving this! Please let me know if there is other information that would be useful.

Additional Files

To fully explain what I can see of the issue, here are two runs with identical arguments. I’ve attached files giving the SLURM submission script, job output file, and zipped run directory respectly, first for a run that did work, and next for one that didn’t. Note that the submission scripts and output files are effectively the same—up to compute node, time, and wandb UID details.

— Job that did work correctly — good_job_results.txt good_job_submission_script.txt offline-run-20230517_070615-a25xgj3m.zip

— Job that didn’t work correctly — bad_job_submission_script.txt bad_job_results.txt offline-run-20230517_061457-bz3nyzr9.zip The major thing to note about the offline run directory that was created was that it contained only the images I wanted to log, and none of the other files—ie. its top-level should contain the following five entries—files logs run-a25xgj3m.wandb tmp wandb—but instead looks has just one—files.

(Also: due to space constraints from GitHub, I had to unzip and delete some of the images so the zip files would fit.)

Environment

WandB version: 0.15.1

OS: Linux with Lustre filesystem—probably modified since it’s the Narval cluster of ComputeCanada

Python version: 3.10.11

Versions of relevant libraries:

Additional Context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

So far I have two successes and zero failures with v0.15.4. I’ll follow up if there are subsequent failures as I’ll be using the cluster fairly heavily this week.

Unfortunately it happened again; I’m continuing to investigate with wandb=0.15.4.

Here’s my wandb.init(), with symlink=False as suggested. args is an argparse Namespace:

wandb.init(anonymous="allow", id=args.uid, config=args,
        mode=args.wandb, project="3MRL", entity="apex-lab",
        name=os.path.basename(model_folder(args)),
        resume="allow" if args.continue_run else "never",
        settings=wandb.Settings(code_dir=os.path.dirname(__file__), symlink=False))

With conda-installed wandb=0.15.1, this appears to fix the issue! I will test the newer WandB version soon, though I suspect I will have to install it via conda, as pip packages are in my experience less likely to work on the cluster in question.