Will multiple runs in the same folder <del>sync</del>resume properly?

Edit: I think I did not express my problem correctly, I was concerned that if there are multiple runs in the same directory and some runs crashed, could wandb resume automatically if I pass the resume=True parameter to wandb.init.

The answer is no, apparently. I think either controlled resuming or running from different working directories is mandatory in this case.

Hi, I wonder if wandb can sync properly if I start multiple runs simultaneously in the same project root?

I was using wandb with only 1 GPU and it worked splendidly, now I want to use the same codebase on a machine with 2 GPUs. I have already started a run with CUDA_VISIBLE_DEVICES=0, now I want to start another run with CUDA_VISIBLE_DEVICES=1 in a new shell session, but in the same directory as the first run. I noticed that the wandb/ directory in the project root seems to track only the latest run (there is a symlink called latest-run), my question is, if I start another run in the same directory while the first one is running, will wandb mess it up? If it does mess up, is cloning the codebase to another path and run there my best option? Or if wandb can properly handle the situation mentioned above, is there any caveats I should be aware of?

Thanks for reading through, any help would be greatly appreciated.

Hi @blurgy,

If you have multiple runs and some of them crashed, wandb can not automatically resume them if the resume=True parameter is passed. The second mandatory parameter to resume a run is id, which is the 8 character alphanumeric ID given to every run. This needs to be specified in order to know which run has to be resumed.

As a result, you will not be able to automatically resume runs by setting resume=True.


This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.