Every wandb command yields "An error occured in MPI_Init_thread"

andravin · November 15, 2022, 11:40pm

I have a user on a multi-user server whose wandb installation appears to be broken. I have forced him to reinstall with

python3 -m pip install --upgrade pip
python3 -m pip install --upgrade --force-reinstall wandb

after which he has wandb-0.13.5.

But this error persists; even the simplest wandb command yields an MPI_Init_thread error:

$ wandb --version
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[xhostnamex:1359609] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

On the same machine, my wandb installation works fine:

$ wandb --version
wandb, version 0.13.5

Apologies if this is a duplicate, but I could not find this exact problem documented anywhere.

ramit_goolry · November 16, 2022, 6:01am

Hey @andravin,

Is this the full stack trace for the error? If so, please run wandb.init() in a python file and share the full error trace from there along with the debug.log and debug-internal.log files from the wandb folder in the directory where you rant this process.

Thanks,
Ramit

moshin · November 18, 2022, 1:37am

Hi @ramit_goolry , I’m the user who @andravin mentioned.

Here is what I did

moshin@pc$ python -V
Python 3.8.10
moshin@pc$ ls
testwandb.py
moshin@pc$ cat testwandb.py 
import wandb

wandb.init()
moshin@pc$ python testwandb.py 
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[pc:1623678] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
moshin@pc$ ls
testwandb.py

It seems the process is broken before wandb folder is created.

andravin · December 5, 2022, 3:14am

We determined the error occurs on every node in the cluster, but only when the user session is created with SLURM. A typical command-line to create the SLURM interactive session is srun --pty bash.

The error is always reproducible simply by import wandb:

$ python3 -c 'import wandb'
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node-name:1431860] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

If the user connects to the node with ssh, then import wandb works as expected.

andravin · December 5, 2022, 5:31am

Also, a different user with the same exact python packages installed via pip does not have this issue.

ramit_goolry · December 6, 2022, 4:49am

Hi @andravin!

Any chance you can try mpirun -np [NUM] instead of srun here? This looks to be an issue with srun your slurm cluster specifically and not wandb, I would check your installation of srun/mpirun to make sure everything is running as expected over there.

Have you tried importing any other libraries through which you also see the same error?

Thanks,
Ramit

ramit_goolry · December 13, 2022, 1:59pm

Hi!

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

artsiom · December 14, 2022, 4:05pm

Hi Andrew,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

andravin · December 16, 2022, 12:02am

@ramit_goolry Thanks for the suggestion, our IT department will look into our SLURM and MPI configuration. It would appear that any issue is specific to the one particular user who sees the error, because WandB works fine for everyone else. So it would have to be a user-id inconsistency or permissions problem I would think.

artsiom · December 16, 2022, 5:04pm

Sounds good! Let us know if you guys have any more concerns related to wandb!

system · February 14, 2023, 12:03am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Login error! init error + broken pipeline W&B Help wandb	5	555	October 28, 2022
Wandb.init() time out error W&B Help	3	777	September 13, 2024
InitStartError: Error communicating with wandb process W&B Help wandb	33	2327	December 31, 2022
Initial timeout after 90 sec W&B Help	11	1245	April 30, 2025
Wandb fails at init (assert ports_found) W&B Help wandb	8	4723	June 3, 2023

Every wandb command yields "An error occured in MPI_Init_thread"

Related topics