Every wandb command yields "An error occured in MPI_Init_thread"

I have a user on a multi-user server whose wandb installation appears to be broken. I have forced him to reinstall with

python3 -m pip install --upgrade pip
python3 -m pip install --upgrade --force-reinstall wandb

after which he has wandb-0.13.5.

But this error persists; even the simplest wandb command yields an MPI_Init_thread error:

$ wandb --version
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[xhostnamex:1359609] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

On the same machine, my wandb installation works fine:

$ wandb --version
wandb, version 0.13.5

Apologies if this is a duplicate, but I could not find this exact problem documented anywhere.

Hey @andravin,

Is this the full stack trace for the error? If so, please run wandb.init() in a python file and share the full error trace from there along with the debug.log and debug-internal.log files from the wandb folder in the directory where you rant this process.

Thanks,
Ramit

Hi @ramit_goolry , Iā€™m the user who @andravin mentioned.

Here is what I did

moshin@pc$ python -V
Python 3.8.10
moshin@pc$ ls
testwandb.py
moshin@pc$ cat testwandb.py 
import wandb

wandb.init()
moshin@pc$ python testwandb.py 
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[pc:1623678] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
moshin@pc$ ls
testwandb.py

It seems the process is broken before wandb folder is created.

1 Like

We determined the error occurs on every node in the cluster, but only when the user session is created with SLURM. A typical command-line to create the SLURM interactive session is srun --pty bash.

The error is always reproducible simply by import wandb:

$ python3 -c 'import wandb'
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-name:1431860] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

If the user connects to the node with ssh, then import wandb works as expected.

Also, a different user with the same exact python packages installed via pip does not have this issue.

Hi @andravin!

Any chance you can try mpirun -np [NUM] instead of srun here? This looks to be an issue with srun your slurm cluster specifically and not wandb, I would check your installation of srun/mpirun to make sure everything is running as expected over there.

Have you tried importing any other libraries through which you also see the same error?

Thanks,
Ramit

Hi!

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi Andrew,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

@ramit_goolry Thanks for the suggestion, our IT department will look into our SLURM and MPI configuration. It would appear that any issue is specific to the one particular user who sees the error, because WandB works fine for everyone else. So it would have to be a user-id inconsistency or permissions problem I would think.

Sounds good! Let us know if you guys have any more concerns related to wandb!

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.