Setting up YAML file for Sweeps

I have the following YAML file:

program: train.py
name: 'sweep 1'
method: bayes 
metric:
  goal: minimize
  name: loss
parameters:
  batch_size:
    values: [128]
  learning_rate: 
    values: [0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05, 0.055, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.095, 0.1] 
  optim: 
    values: ["Adam", "Adamax", "AdamW", "SGD", "RMSprop", "Adagrad"]
  epochs: 
    values: [50]
  loss: 
    values: ['L1', 'MSE', 'BCE', 'CrossEntropy']
  accuracy:
    values:  ['1-L1_loss', '1-MSE_loss', '1-BCE_loss', '1-CrossEntropy_loss']
  activation: 
    values : ['ReLU', 'Sigmoid', 'Tanh', 'LeakyReLU']
early_terminate:
  type: hyperband
  min_iter: 3
command:
- ${env}
- /my python executable path/
- script.py
- ${args}

I followed the documentation in sweep docs to the best of my ability. I would like to start a discussion to better help me understand how sweeps uses this and more importantly to make show this works properly for my project.

My first question is if my the first section of my configration file correct? Second is my envrionment variables section correct?

Thanks in advance :slight_smile:

Hi @kishimita, thanks for writing in! Your sweep config looks correct, are you having any issues with it? The command looks good as well, see here

Hi @kishimita,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

sorry for the late reply, i accidentally deleted the email notification. Could i provide you with my train function for more context?
I first started using wand.init and used it to log my runs and the speed of my training was expected. Now for some reason when i use sweeps it takes a day to complete 50 epochs.

Hi @kishimita! Yes, if you could share a code example I’ll be happy to take a look and test it to see what’s going on here

# Login to wandb
wandb.login()
#create wandb sweep id 
with open("pathto_config", 'r') as stream:
     sweep_config = yaml.safe_load(stream)
sweep_id = wandb.sweep(sweep=sweep_config, entity="kishimita", project="Simple-Unet-Training", prior_runs=["run-1"])

print("Sweep config: ", sweep_config)   
def get_optimizer(optimizer_name, model, learning_rate):
    #optimizer_name = optimizer_name.strip()  # Remove any leading/trailing white spaces
    if optimizer_name == "Adam":
      return torch.optim.Adam(model.parameters(), lr=learning_rate)
    elif optimizer_name == "SGD":
      return torch.optim.SGD(model.parameters(), lr=learning_rate)
    elif optimizer_name == "AdamW":
      return torch.optim.AdamW(model.parameters(), lr=learning_rate)
    elif optimizer_name == "Adamax": 
      return torch.optim.Adamax(model.parameters(), lr=learning_rate)
    elif optimizer_name == "RMSprop":
      return torch.optim.RMSprop(model.parameters(), lr=learning_rate)
    elif optimizer_name == "Adagrad":
      return torch.optim.Adagrad(model.parameters(), lr=learning_rate)
    else:
      raise ValueError(f"Unknown optimizer: {optimizer_name}")

def train():
    global device
    config = sweep_config["parameters"]
    model.to(device)
    count = 0 
    optimizer = get_optimizer(config["optim"]['values'][count], model, config["learning_rate"]['values'][count])
    lr = config["learning_rate"]['values'][count]
    epochs = config["epochs"]['values'][count]
    print(len("------------------------------------------------------------------------------------------------------------"))
    run = wandb.init(project="Simple-Unet-Training",
                    config={
                    "learning_rate": lr,
                    "architecture": "Simple Unet",
                    "dataset": "military planes",
                    "epochs": epochs,
                    "optimizer": optimizer,
                    "loss": "L1",
                    "metric": "L1",
                    "framework": "PyTorch",
                    "device": DEVICE,
                    "torch_seed" : seed
                    },
                    name="genesis-run" + "-" +str(count+1),
                    save_code=False,)

    run.config.update(config)
    print("*~+~*"*22)
    print("\t\t\tThis is the start of training in mins: ", datetime.datetime.now())
    print("*~+~*"*22)
    memory_count = 0
    for epoch in range(epochs):
        epoch_start = datetime.datetime.now()
        print("--------------------------------------------------------------------------------------------------------------")
        print(f"\t\t\t\tThis is epoch : {epoch}'s start time: {epoch_start}")
        print("--------------------------------------------------------------------------------------------------------------\n")
        total_loss = 0
        total_accuracy = 0
        #print(f"Epoch :{epoch}")
        for step, batch in tqdm(enumerate(train_loader), desc= "Step Loop", ncols=100):
            optimizer.zero_grad()
            
            t = torch.randint(0, T, (BATCH_SIZE,), device=device).long()
            # Move the input data to the GPU
            batch_gpu = batch[0].to(device)
            loss = get_loss(model, batch_gpu, t)
            loss.backward()
            optimizer.step()
    
            # Calculate accuracy
            accuracy = accuracy_l1(model, batch_gpu, t)
            total_accuracy += accuracy.item()
            total_loss += loss.item()
        
        
        print(f"This is memory usage after inner loop ends time {memory_count}")
        memory_count += 1
        print_memory_usage()
        # Select the first image from the batch
        input_image = batch_gpu[0]
        output_image = model(input_image.unsqueeze(0), t)[0]


        #log input and output image in the same log 
        wandb.log({"Input Image": wandb.Image(input_image.detach().cpu(), caption="Input Image-" + str(count))
                   ,"Output Image": wandb.Image(output_image.detach().cpu(), caption="Output Image-" + str(count))})
        del batch_gpu
        if epoch % 5 == 0 and step == 0:
            print(f"Epoch {epoch} | step {step:03d} Loss: {loss.item()} ")
            #sample_plot_image()
        wandb.log({"Lr": lr})
        wandb.log({"epoch": epoch})
        wandb.log({"Loss": total_loss/len(train_loader)})
        wandb.log({"Accuracy": total_accuracy/len(train_loader)})
        print(f"Total Epochs : {epochs}")
        print(f"Current Epoch : {epoch}")
        print(f"Optimizer : {optimizer}")
        print(f"Lr : {lr}")
        print(f"Loss : {loss.item()}")
        del loss  
        print(f"Accuracy : {accuracy.item()}")
        del accuracy
        print(f"Total Loss : {total_loss/len(train_loader)}")
        del total_loss
        print(f"Total Accuracy : {total_accuracy/len(train_loader)}")
        del total_accuracy
        epoch_end = datetime.datetime.now()
        print("--------------------------------------------------------------------------------------------------------------")
        print(f"\t\t\tThis is epoch :{epoch}'s end time: {epoch_end}")
        print("--------------------------------------------------------------------------------------------------------------\n")
    
    count += 1
    run.finish()
    print("*~+~*"*12)
    print(f"\t\t\t\tThis is the end of training in mins:  {datetime.datetime.now()}")
    print("*~+~*"*12)
    config.finish()


wandb.agent(sweep_id="shtf1crd", function=train, project="Simple-Unet-Training", entity="kishimita")

here is the yaml config file

program: train.py
name: 'sweep 1'
method: bayes 
metric:
  goal: minimize
  name: L1
parameters:
  batch_size:
    values: [128]
  learning_rate: 
    values: [0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05, 0.055, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.095, 0.1] 
  optim: 
    values: ["Adam", "Adamax", "AdamW", "SGD", "RMSprop", "Adagrad"]
  epochs: 
    values: [100, 150, 200, 250, 300, 350]
  loss: 
    values: ['L1', 'MSE', 'BCE', 'CrossEntropy']
  accuracy:
    values:  ['1-L1_loss', '1-MSE_loss', '1-BCE_loss', '1-CrossEntropy_loss']
  activation: 
    values : ['ReLU', 'Sigmoid', 'Tanh', 'LeakyReLU']
early_terminate:
  type: hyperband
  min_iter: 3
command:
- ${env}
- path to python executable
- CUDA_VISIBLE_DEVICES = 1
- train.py
- ${args}

Thank you for your help :slight_smile:

Thanks for sharing this @kishimita! In your train() function, you’re accessing hyperparameters directly from sweep_config["parameters"] and indexing into their 'values' arrays, which isn’t the recommended way since, when you initiate a sweep, wandb creates individual runs where each run is assigned a unique set of hyperparameters based on your sweep configuration. These hyperparameters are accessible via wandb.config within the train() function, so you should use it to access the values. Same thing with the init function, hyperparameters are automatically picked up the from the sweep configuration. The train() function should look like this:

def train():
    global device
    # Start a new W&B run
    run = wandb.init()
    config = wandb.config

    # Initialize model and optimizer with current hyperparameters
    model.to(device)
    optimizer = get_optimizer(config.optim, model, config.learning_rate)
    epochs = config.epochs

    # Update run config with fixed parameters
    run.config.update({
        "architecture": "Simple Unet",
        "dataset": "military planes",
        "loss": "L1",
        "metric": "L1",
        "framework": "PyTorch",
        "device": DEVICE,
        "torch_seed": seed
    }, allow_val_change=True)

    print(f"Starting training with config: {config}")

    for epoch in range(epochs):
        # ... training loop ...

    run.finish()

And I would recommend passing the sweep config as a dict inside the same file as explained here or using cli commands for everything (wandb sweep and wandb agent)

1 Like

Luis thank you for the explanation!. Im going to try this thursday and give u an update!

Hey @kishimita, just wanted to follow up here to see if the information I provided was helpful?

It was helpful, I wasnt able to work on it today, and wont get to it today, its looking like a next tuesday thing.

Thanks for the update! Please let me know how it goes

Hey @kishimita, just wanted to check if you had the chance to take a look at the resources I shared?

Hi @kishimita, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Luis im not sure if its the same issue that causing my new problem but its related i ran a sweep using ur suggestion but the output looked like this :

How come i can find a tab or somewhere I can click for me to see where all the runs went. Myabe my code is messed up somehow.

@luis_bergua ^ not sure if u wouldve gotten a notification, if you did sorry for the extra notification.