PyTorch Lightning Fabric Sweeps

I’m trying to create a sweep using wandb agent, which runs a python script that uses pytorch lightning fabric and the WandbLogger object from lightning. The script uses FSDP for distributed compute.

When I run this, I get “wandb.sdk.lib.mailbox.MailboxError: transport failed”.

If I run “python train.py” instead of using the wandb agent, it runs fine. If I drop devices down to 1, it runs fine.

Any help anyone could give would be truly fantastic.

Hello @william_gazeley !

Based on what you are describing, this looks to be like multiple runs of the sweep are interacting are utilizing the same python thread then shutting down. Our wandb agent will run multiple runs at the same time and depending how you define your distributed compute, this could lead to that error.

We have some guidance here on how to deal with distributed training and how best to define it within the confines of wandb. Your setup may work well with one script, but could lead to overlaps between multiple runs so I recommend following some of the setup instructions from the linked docs.

Hi William-Gazeley, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!