Setting up YAML file for Sweeps

kishimita · September 10, 2024, 2:40pm

I have the following YAML file:

program: train.py
name: 'sweep 1'
method: bayes 
metric:
  goal: minimize
  name: loss
parameters:
  batch_size:
    values: [128]
  learning_rate: 
    values: [0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05, 0.055, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.095, 0.1] 
  optim: 
    values: ["Adam", "Adamax", "AdamW", "SGD", "RMSprop", "Adagrad"]
  epochs: 
    values: [50]
  loss: 
    values: ['L1', 'MSE', 'BCE', 'CrossEntropy']
  accuracy:
    values:  ['1-L1_loss', '1-MSE_loss', '1-BCE_loss', '1-CrossEntropy_loss']
  activation: 
    values : ['ReLU', 'Sigmoid', 'Tanh', 'LeakyReLU']
early_terminate:
  type: hyperband
  min_iter: 3
command:
- ${env}
- /my python executable path/
- script.py
- ${args}

I followed the documentation in sweep docs to the best of my ability. I would like to start a discussion to better help me understand how sweeps uses this and more importantly to make show this works properly for my project.

My first question is if my the first section of my configration file correct? Second is my envrionment variables section correct?

Thanks in advance

luis_bergua · September 13, 2024, 3:20pm

Hi @kishimita, thanks for writing in! Your sweep config looks correct, are you having any issues with it? The command looks good as well, see here

luis_bergua · September 18, 2024, 10:09am

Hi @kishimita,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

kishimita · September 18, 2024, 12:19pm

sorry for the late reply, i accidentally deleted the email notification. Could i provide you with my train function for more context?
I first started using wand.init and used it to log my runs and the speed of my training was expected. Now for some reason when i use sweeps it takes a day to complete 50 epochs.

luis_bergua · September 23, 2024, 9:19am

Hi @kishimita! Yes, if you could share a code example I’ll be happy to take a look and test it to see what’s going on here

kishimita · September 24, 2024, 7:41pm

# Login to wandb
wandb.login()
#create wandb sweep id 
with open("pathto_config", 'r') as stream:
     sweep_config = yaml.safe_load(stream)
sweep_id = wandb.sweep(sweep=sweep_config, entity="kishimita", project="Simple-Unet-Training", prior_runs=["run-1"])

print("Sweep config: ", sweep_config)   
def get_optimizer(optimizer_name, model, learning_rate):
    #optimizer_name = optimizer_name.strip()  # Remove any leading/trailing white spaces
    if optimizer_name == "Adam":
      return torch.optim.Adam(model.parameters(), lr=learning_rate)
    elif optimizer_name == "SGD":
      return torch.optim.SGD(model.parameters(), lr=learning_rate)
    elif optimizer_name == "AdamW":
      return torch.optim.AdamW(model.parameters(), lr=learning_rate)
    elif optimizer_name == "Adamax": 
      return torch.optim.Adamax(model.parameters(), lr=learning_rate)
    elif optimizer_name == "RMSprop":
      return torch.optim.RMSprop(model.parameters(), lr=learning_rate)
    elif optimizer_name == "Adagrad":
      return torch.optim.Adagrad(model.parameters(), lr=learning_rate)
    else:
      raise ValueError(f"Unknown optimizer: {optimizer_name}")

def train():
    global device
    config = sweep_config["parameters"]
    model.to(device)
    count = 0 
    optimizer = get_optimizer(config["optim"]['values'][count], model, config["learning_rate"]['values'][count])
    lr = config["learning_rate"]['values'][count]
    epochs = config["epochs"]['values'][count]
    print(len("------------------------------------------------------------------------------------------------------------"))
    run = wandb.init(project="Simple-Unet-Training",
                    config={
                    "learning_rate": lr,
                    "architecture": "Simple Unet",
                    "dataset": "military planes",
                    "epochs": epochs,
                    "optimizer": optimizer,
                    "loss": "L1",
                    "metric": "L1",
                    "framework": "PyTorch",
                    "device": DEVICE,
                    "torch_seed" : seed
                    },
                    name="genesis-run" + "-" +str(count+1),
                    save_code=False,)

    run.config.update(config)
    print("*~+~*"*22)
    print("\t\t\tThis is the start of training in mins: ", datetime.datetime.now())
    print("*~+~*"*22)
    memory_count = 0
    for epoch in range(epochs):
        epoch_start = datetime.datetime.now()
        print("--------------------------------------------------------------------------------------------------------------")
        print(f"\t\t\t\tThis is epoch : {epoch}'s start time: {epoch_start}")
        print("--------------------------------------------------------------------------------------------------------------\n")
        total_loss = 0
        total_accuracy = 0
        #print(f"Epoch :{epoch}")
        for step, batch in tqdm(enumerate(train_loader), desc= "Step Loop", ncols=100):
            optimizer.zero_grad()
            
            t = torch.randint(0, T, (BATCH_SIZE,), device=device).long()
            # Move the input data to the GPU
            batch_gpu = batch[0].to(device)
            loss = get_loss(model, batch_gpu, t)
            loss.backward()
            optimizer.step()
    
            # Calculate accuracy
            accuracy = accuracy_l1(model, batch_gpu, t)
            total_accuracy += accuracy.item()
            total_loss += loss.item()
        
        
        print(f"This is memory usage after inner loop ends time {memory_count}")
        memory_count += 1
        print_memory_usage()
        # Select the first image from the batch
        input_image = batch_gpu[0]
        output_image = model(input_image.unsqueeze(0), t)[0]


        #log input and output image in the same log 
        wandb.log({"Input Image": wandb.Image(input_image.detach().cpu(), caption="Input Image-" + str(count))
                   ,"Output Image": wandb.Image(output_image.detach().cpu(), caption="Output Image-" + str(count))})
        del batch_gpu
        if epoch % 5 == 0 and step == 0:
            print(f"Epoch {epoch} | step {step:03d} Loss: {loss.item()} ")
            #sample_plot_image()
        wandb.log({"Lr": lr})
        wandb.log({"epoch": epoch})
        wandb.log({"Loss": total_loss/len(train_loader)})
        wandb.log({"Accuracy": total_accuracy/len(train_loader)})
        print(f"Total Epochs : {epochs}")
        print(f"Current Epoch : {epoch}")
        print(f"Optimizer : {optimizer}")
        print(f"Lr : {lr}")
        print(f"Loss : {loss.item()}")
        del loss  
        print(f"Accuracy : {accuracy.item()}")
        del accuracy
        print(f"Total Loss : {total_loss/len(train_loader)}")
        del total_loss
        print(f"Total Accuracy : {total_accuracy/len(train_loader)}")
        del total_accuracy
        epoch_end = datetime.datetime.now()
        print("--------------------------------------------------------------------------------------------------------------")
        print(f"\t\t\tThis is epoch :{epoch}'s end time: {epoch_end}")
        print("--------------------------------------------------------------------------------------------------------------\n")
    
    count += 1
    run.finish()
    print("*~+~*"*12)
    print(f"\t\t\t\tThis is the end of training in mins:  {datetime.datetime.now()}")
    print("*~+~*"*12)
    config.finish()


wandb.agent(sweep_id="shtf1crd", function=train, project="Simple-Unet-Training", entity="kishimita")

here is the yaml config file

program: train.py
name: 'sweep 1'
method: bayes 
metric:
  goal: minimize
  name: L1
parameters:
  batch_size:
    values: [128]
  learning_rate: 
    values: [0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05, 0.055, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.095, 0.1] 
  optim: 
    values: ["Adam", "Adamax", "AdamW", "SGD", "RMSprop", "Adagrad"]
  epochs: 
    values: [100, 150, 200, 250, 300, 350]
  loss: 
    values: ['L1', 'MSE', 'BCE', 'CrossEntropy']
  accuracy:
    values:  ['1-L1_loss', '1-MSE_loss', '1-BCE_loss', '1-CrossEntropy_loss']
  activation: 
    values : ['ReLU', 'Sigmoid', 'Tanh', 'LeakyReLU']
early_terminate:
  type: hyperband
  min_iter: 3
command:
- ${env}
- path to python executable
- CUDA_VISIBLE_DEVICES = 1
- train.py
- ${args}

Thank you for your help

luis_bergua · September 30, 2024, 11:20am

Thanks for sharing this @kishimita! In your train() function, you’re accessing hyperparameters directly from sweep_config["parameters"] and indexing into their 'values' arrays, which isn’t the recommended way since, when you initiate a sweep, wandb creates individual runs where each run is assigned a unique set of hyperparameters based on your sweep configuration. These hyperparameters are accessible via wandb.config within the train() function, so you should use it to access the values. Same thing with the init function, hyperparameters are automatically picked up the from the sweep configuration. The train() function should look like this:

def train():
    global device
    # Start a new W&B run
    run = wandb.init()
    config = wandb.config

    # Initialize model and optimizer with current hyperparameters
    model.to(device)
    optimizer = get_optimizer(config.optim, model, config.learning_rate)
    epochs = config.epochs

    # Update run config with fixed parameters
    run.config.update({
        "architecture": "Simple Unet",
        "dataset": "military planes",
        "loss": "L1",
        "metric": "L1",
        "framework": "PyTorch",
        "device": DEVICE,
        "torch_seed": seed
    }, allow_val_change=True)

    print(f"Starting training with config: {config}")

    for epoch in range(epochs):
        # ... training loop ...

    run.finish()

And I would recommend passing the sweep config as a dict inside the same file as explained here or using cli commands for everything (wandb sweep and wandb agent)

kishimita · October 1, 2024, 7:07pm

Luis thank you for the explanation!. Im going to try this thursday and give u an update!

luis_bergua · October 4, 2024, 11:09am

Hey @kishimita, just wanted to follow up here to see if the information I provided was helpful?

kishimita · October 4, 2024, 7:35pm

It was helpful, I wasnt able to work on it today, and wont get to it today, its looking like a next tuesday thing.

luis_bergua · October 7, 2024, 8:25am

Thanks for the update! Please let me know how it goes

luis_bergua · October 9, 2024, 10:36am

Hey @kishimita, just wanted to check if you had the chance to take a look at the resources I shared?

luis_bergua · October 11, 2024, 1:00pm

Hi @kishimita, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

kishimita · November 5, 2024, 7:13pm

Luis im not sure if its the same issue that causing my new problem but its related i ran a sweep using ur suggestion but the output looked like this :

How come i can find a tab or somewhere I can click for me to see where all the runs went. Myabe my code is messed up somehow.

kishimita · November 5, 2024, 8:39pm

@luis_bergua ^ not sure if u wouldve gotten a notification, if you did sorry for the extra notification.

Topic		Replies	Views
Multi-level nesting in yaml for sweeps W&B Help sweeps , beginner-friendly	6	2872	November 17, 2022
Nested Sweep Configuration W&B Help sweeps , wandb	4	3046	January 15, 2023
Sweep - starting with a small project W&B Help	4	642	May 20, 2022
W&B Sweeps w/ Self-Supervised Learning W&B Help sweeps	6	927	September 17, 2022
Key Error in wandb.config when using wandb.sweep in pytorch W&B Help sweeps , wandb	2	715	October 10, 2022

Setting up YAML file for Sweeps

Related topics