Finish() is going into loop in distributed setting

malla-bhavana26 · January 29, 2024, 3:00pm

Hi,

wandb version : 0.16.2
OS: Linux-5.4.0-131-generic-x86_64-with-glibc2.17
Python version: 3.8.18

I am training a model from hugging face in a distributed setting using 2 GPUs of a single node and using wandb to track my experiments. I could see that wandb can export the logged metrics, images, and tables that I am logging during training to the wandb server, and is visible in UI. At the end of training, I am calling to run.finish() as mentioned in the docs. But I could see that it was going into a never-ending loop and it’s been in that state for more than 2 hours. I am just training with a very tiny subset of the dataset, a total of 6 images and 3 per batch to see if size was the issue. But with this small subset as well, the same issue persists. Please guide me on how to solve this as this issue is stalling my experiments.

Please note that run.finish() is working in single GPU training. The problem occurs in DDP setting.

Looking forward to your reply.

Thanks in advance!

uma-wandb · February 1, 2024, 5:07pm

hey @malla-bhavana26 - few questions to help me dig into this:

are you able to access the debug logs from this run? they should be located in the wandb folder in the same directory as where the script was run. the wandb folder has folders formatted as run-DATETIME-ID associated with a single run
are you uploading any large artifacts anywhere? if you could provide details surrounding this and whether or not you’re utilizing an external bucket, this would help me dig into this further
are you running this in a notebook or a python script?

malla-bhavana26 · February 1, 2024, 6:43pm

Hey @uma-wandb ,

Thanks for your response… I am running it from a Python script and the artifacts are not really big. I was calling run.finish() at the end of my training loop which could be the reason behind this never-ending loop. I was able to solve it by calling run.init() before spawning the processes and calling the run.finish() in the same main function once all the training is done.

Best Regards

uma-wandb · February 1, 2024, 9:46pm

@malla-bhavana26 - great to hear you were able to solve this. please feel free to write back in anytime!

Topic		Replies	Views
Wandb process not getting terminated properly W&B Help wandb	4	1002	January 3, 2022
Run.finish() hangs W&B Help	5	1440	July 3, 2023
What happens if the code crashes in the middle and there was no time to fo a .finish? W&B Help	6	2791	April 20, 2022
Wandb.finish() takes too long to finish W&B Help wandb	2	789	July 16, 2023
Wandb stops uploading data W&B Help wandb	19	1729	February 29, 2024

Finish() is going into loop in distributed setting

Related topics