wandb: MPI_INIT failed when calling wandb.init()

  • Weights and Biases version: 0.8.34
  • Python version: 3.8.1
  • Operating System: Arch Linux (kernel 5.5.7.arch-1)

Description

When using slurm to init wandb, it seems that it fail to init mpi. Maybe the error is due to wandb try to init mpi again when calling wandb.init().

Here is the related issue https://github.com/open-mpi/ompi/issues/7025#issuecomment-536826728

What I Did

srun: job 52496 queued and waiting for resources
srun: job 52496 has been allocated resources
Python 3.8.1 (default, Jan 22 2020, 06:38:00)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import wandb
>>> wandb.init()
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: W&B is a tool that helps track and visualize machine learning experiments
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose 'Don't visualize my results'
wandb: No credentials found.  Run "wandb login" to visualize your metrics
wandb: Tracking run with wandb version 0.8.35
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[s15.speech:244894] Local abort before MPI_INIT completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: Job step aborted: Waiting up to 182 seconds for job step to finish.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

Hey @pohanchi @Liangtaiwan we’re spinning up our own Slurm environment to reproduce this. In the mean time, can one or both of you try our new CLI which should work better in these environments? You can find it here: https://github.com/wandb/client-ng

pip install wandb-ng