But this error persists; even the simplest wandb command yields an MPI_Init_thread error:
$ wandb --version
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[xhostnamex:1359609] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
On the same machine, my wandb installation works fine:
$ wandb --version
wandb, version 0.13.5
Apologies if this is a duplicate, but I could not find this exact problem documented anywhere.
Is this the full stack trace for the error? If so, please run wandb.init() in a python file and share the full error trace from there along with the debug.log and debug-internal.log files from the wandb folder in the directory where you rant this process.
moshin@pc$ python -V
Python 3.8.10
moshin@pc$ ls
testwandb.py
moshin@pc$ cat testwandb.py
import wandb
wandb.init()
moshin@pc$ python testwandb.py
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[pc:1623678] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
moshin@pc$ ls
testwandb.py
It seems the process is broken before wandb folder is created.
We determined the error occurs on every node in the cluster, but only when the user session is created with SLURM. A typical command-line to create the SLURM interactive session is srun --pty bash.
The error is always reproducible simply by import wandb:
$ python3 -c 'import wandb'
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node-name:1431860] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
If the user connects to the node with ssh, then import wandb works as expected.
Any chance you can try mpirun -np [NUM] instead of srun here? This looks to be an issue with srun your slurm cluster specifically and not wandb, I would check your installation of srun/mpirun to make sure everything is running as expected over there.
Have you tried importing any other libraries through which you also see the same error?
We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.
We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.
@ramit_goolry Thanks for the suggestion, our IT department will look into our SLURM and MPI configuration. It would appear that any issue is specific to the one particular user who sees the error, because WandB works fine for everyone else. So it would have to be a user-id inconsistency or permissions problem I would think.