My model runs on multiple nodes/GPUs using the “ddp” strategy in Pytorch Lightning. I log my runs to wandb via WandbLoggger in Lightning. I wonder, if it is possible to use wandb sweeps in this setup? From the docs I got the impression that it is not, hope that I am mistaken.
Right now I do wandb.init() only on the rank zero worker. I guess with sweeps this would not be possible, because all the workers would need to pick up the hyperparameters from wandb.config? Also there is the issue of running the agent from multiple nodes - will they step on each other’s toes?
Hi @evgeny-tanhilevich , thanks for writing in and happy to help. We have relatively limited examples with multi-GPU training, because our examples are mostly backed by Colab, which only offers a single GPU. If you’re comfortable only submitting metrics from the rank0 process, this shouldn’t be any harder, on the wandb side, than running a sweep with single-node/single-GPU agents. There’s no need for wandb to know about the other processes driving the other GPUs. That approach is Method 1 here. You would need to establish a form of synchronized interprocess communication channel in order to get this working in the best way possible. This way, you can share the wandb.config to all your other processes and the data to be logged back to the Rank 0 process.
You can additionally look into paralleling sweep agents so you could specify which cuda enabled gpu to run a sweep agent on and keep everything separate.
Hi @evgeny-tanhilevich since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!