Using wandb sweep with torch.distributed.launch

raeh · May 23, 2022, 12:30pm

Hello

I am using wandb sweep to perform hyperparameter tuning.

Basically when I launch wandb agent with “wandb agent <USERNAME/PROJECTNAME/SWEEPID>”,

It will automatically run “/usr/bin/env python train.py --param1=value1 --param2=value2” according to the configurations.

However my code is based on torch distributed data parallel and it has to be launched with torch.distributed.launch train.py rather than just python train.py.

How can I tackle this problem?

Many thanks in advance!

ramit_goolry · May 24, 2022, 9:56pm

Hi @rash!

Thanks for writing in. You can change the command that the agent runs by specifying the command structure in your sweep config. Specifically, you can change the interpreter variable to switch to torch.distributed.launch. Here is a link to our docs regarding how this can be done.

Please let me know if I can be of further assistance.

Thanks,
Ramit

raeh · May 25, 2022, 2:26am

Thanks Ramit

I have followed what you suggested but I am still unable to run with torch.distributed.launch.

Below is my configuration yaml file.

‘’’’’
method: random

program: rae_wandb.py

metric:

name: total_mean_rank_sum

goal: minimize

command:

${env}
torch.distributed.launch
${program}
${args}

#command:

#- python raw_wandb.py

#- python -m torch.distributed.launch --nproc_per_node=4 rae_wandb.py -m torch.distributed.launch --nproc_per_node=4

parameters:

lr:

min: 0.0

max: 0.01

coef_lr:

min: 0.0

max: 0.01

sim_header:

values: ["meanP", "seqLSTM", "seqTransf"]

‘’’’’’’’

when I launch an agent , it runs /usr/bin/env torch.distributed.launch rae_wandb.py --coef_lr=0.0068455254534794605 --lr=0.008759887226936639 --sim_header=seqTransf

what I really need is /usr/bin/env python -m torch.distributed.launch rae_wandb.py --coef_lr=0.0068455254534794605 --lr=0.008759887226936639 --sim_header=seqTransf

I would like to find some descriptive examples.

Many thanks!

ramit_goolry · May 25, 2022, 4:13pm

Hey @raeh,

The following should work in this case then:

command:
    - ${env}
    - ${interpreter}
    - "-m"
    - "torch.distributed.launch"
    - ${program}
    - ${args}

Please let me know if this solves the issue for you.

Thanks,
Ramit

ramit_goolry · June 2, 2022, 5:50pm

Hi Ray,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

ramit_goolry · June 8, 2022, 8:04pm

Hi Ray, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

system · July 24, 2022, 4:13pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help with running a sweep agent on a multi-gpu machine with pytorch DistributedDataParallel W&B Help sweeps	3	431	April 17, 2024
Has anyone used wandb sweeps and torch.distributed before? W&B Help	2	397	June 3, 2022
Sweep - starting with a small project W&B Help	4	586	May 20, 2022
Sweeps while using MPI and SLURM W&B Help sweeps	6	1694	August 1, 2022
Sweep main function with arguments W&B Help sweeps	3	883	September 11, 2023

Using wandb sweep with torch.distributed.launch

Related topics