Finish() is going into loop in distributed setting

Hi,

wandb version : 0.16.2
OS: Linux-5.4.0-131-generic-x86_64-with-glibc2.17
Python version: 3.8.18

I am training a model from hugging face in a distributed setting using 2 GPUs of a single node and using wandb to track my experiments. I could see that wandb can export the logged metrics, images, and tables that I am logging during training to the wandb server, and is visible in UI. At the end of training, I am calling to run.finish() as mentioned in the docs. But I could see that it was going into a never-ending loop and it’s been in that state for more than 2 hours. I am just training with a very tiny subset of the dataset, a total of 6 images and 3 per batch to see if size was the issue. But with this small subset as well, the same issue persists. Please guide me on how to solve this as this issue is stalling my experiments.

Please note that run.finish() is working in single GPU training. The problem occurs in DDP setting.

Looking forward to your reply.

Thanks in advance!

hey @malla-bhavana26 - few questions to help me dig into this:

  • are you able to access the debug logs from this run? they should be located in the wandb folder in the same directory as where the script was run. the wandb folder has folders formatted as run-DATETIME-ID associated with a single run
  • are you uploading any large artifacts anywhere? if you could provide details surrounding this and whether or not you’re utilizing an external bucket, this would help me dig into this further
  • are you running this in a notebook or a python script?

Hey @uma-wandb ,

Thanks for your response… I am running it from a Python script and the artifacts are not really big. I was calling run.finish() at the end of my training loop which could be the reason behind this never-ending loop. I was able to solve it by calling run.init() before spawning the processes and calling the run.finish() in the same main function once all the training is done.

Best Regards

@malla-bhavana26 - great to hear you were able to solve this. please feel free to write back in anytime!