Every wandb command yields "An error occured in MPI_Init_thread"

I have a user on a multi-user server whose wandb installation appears to be broken. I have forced him to reinstall with

python3 -m pip install --upgrade pip
python3 -m pip install --upgrade --force-reinstall wandb

after which he has wandb-0.13.5.

But this error persists; even the simplest wandb command yields an MPI_Init_thread error:

$ wandb --version
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[xhostnamex:1359609] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

On the same machine, my wandb installation works fine:

$ wandb --version
wandb, version 0.13.5

Apologies if this is a duplicate, but I could not find this exact problem documented anywhere.

Hey @andravin,

Is this the full stack trace for the error? If so, please run wandb.init() in a python file and share the full error trace from there along with the debug.log and debug-internal.log files from the wandb folder in the directory where you rant this process.

Thanks,
Ramit

Hi @ramit_goolry , Iā€™m the user who @andravin mentioned.

Here is what I did

moshin@pc$ python -V
Python 3.8.10
moshin@pc$ ls
testwandb.py
moshin@pc$ cat testwandb.py 
import wandb

wandb.init()
moshin@pc$ python testwandb.py 
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[pc:1623678] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
moshin@pc$ ls
testwandb.py

It seems the process is broken before wandb folder is created.

1 Like