How to continue a specific run after stopping?

frem · March 14, 2022, 3:59pm

Hello, I am new to using wandb and I cant seem to wrap my head around how to continue a run after i stop it. I tried the wandb.restore and loading the weights from the “wandb\run-20220313_020710-18ws9vua\files” , but i seem to get the following error : “wandb.errors.CommError: Could not find run” .
It seems that the run isn’t unique? Do i need to set up something in the init ? Or am I just going about it all wrong?
Thank you very much for your time !

ramit_goolry · March 14, 2022, 9:57pm

Hi @frem,

I’m sorry you are facing this issue. Could you have a look at our Resuming Guide to see if this resolves your issue?

If not, I would really appreciate it if you could send a whole stack trace over, I’ll help you debug this further.

Thanks,
Ramit

frem · March 17, 2022, 2:27pm

Hello, it seems resume is the way to go. I tried to implement it but it doesn’t seem to be working right.

this is my code, I have removed everything that has to do with the training and pre-processing, when I first start training I have resume = False, and after I interrupt the run I change it to resume = True and run it again, but a new run starts, also the checkpoints folder I created is empty so no files have been saved .

thank you very much for your time again !

def main():
  #Weight and Biases
  torch.manual_seed(0) # to fix the split result

  CHECKPOINT_PATH = r'C:\Users\wandb\checkpoints'

  run = wandb.init(project="my-test-project", entity="frem",save_code=True, resume= True)
  if wandb.run.resumed:
    checkpoint = torch.load(wandb.restore(CHECKPOINT_PATH))
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    train_batch_loss = checkpoint['train_batch_loss']
    train_batch_acc = checkpoint['train_batch_acc']
    epoch_train_loss = checkpoint['epoch_train_loss']
    epoch_train_acc = checkpoint['epoch_train_acc']
    epoch_val_loss = checkpoint['epoch_val_loss']
    epoch_val_acc = checkpoint['epoch_val_acc']
    best_loss = checkpoint['best_loss']
    counter = checkpoint['counter']
    early_stop = checkpoint['early_stop']

   CONFIG =  dict(
    model_conf= "resnet50",
    lr_conf= 0.001,
    max_epochs= 100,
    batch_size= 32,
    optimizer= "Adam",
    loss_conf= "CrossEntropyLoss"
   )
   wandb.config = CONFIG
   print("\tWANDB SET UP DONE")

   # TRANSFORMS

   # DATA SET UP

   # MODEL SET UP

    # Training and Validation

    #after training 1 epoch I log these
     wandb.log({
            "train_batch_loss": train_batch_loss,
            "train_batch_acc": train_batch_acc
        })
    
      wandb.log({
        "epoch_train_loss": torch.tensor(train_losses).mean(),
        "epoch_train_acc": torch.tensor(train_accuracies).mean()
      })

      #after validation I log
       
       wandb.log({
          "epoch_val_loss": torch.tensor(val_losses).mean(),
          "epoch_val_acc": torch.tensor(val_accuracies).mean()
       })
 
     #and in in the end of the each epoch

      torch.save({ # Save our checkpoint loc
           'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'train_batch_loss': train_batch_loss,
            'train_batch_acc': train_batch_acc,
            'epoch_train_loss': torch.tensor(train_losses).mean(),
            'epoch_train_acc': torch.tensor(train_accuracies).mean(),
            'epoch_val_loss': torch.tensor(val_losses).mean(),
            'epoch_val_acc': torch.tensor(val_accuracies).mean(),
            'best_loss': best_loss,
            'counter': counter,
            'early_stop': early_stop,
            }, CHECKPOINT_PATH)
       wandb.save(CHECKPOINT_PATH) # saves checkpoint to wandb
    
    
       torch.save(model.state_dict(), os.path.join('D:\Art DataBase\models',f"{CONFIG['model_conf']}_{epoch}.pth"))
       wandb.save(os.path.join('D:\Art DataBase\models', f"{CONFIG['model_conf']}_{epoch}.pth"))

ramit_goolry · March 30, 2022, 4:50pm

Hi @frem,

Are you passing in the id of the previous run you are trying to resume? The SDK needs to know the ID in order to pick up the run where it left off. Odds are that wandb.init() is creating a new run ID and looking for that ID in your previous runs, which would explain the “Could not find run” error message you are recieving.

Thanks,
Ramit

ramit_goolry · April 4, 2022, 4:07pm

Hi @frem ,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

ramit_goolry · April 7, 2022, 5:39pm

Hi @frem , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

frem · April 13, 2022, 3:30am

Hi, I am a bit confused with how ids work, do I need to use WANDB_RESUME and WANDB_RUN_ID in order to set the id for the run, or altering the way that I use wand.init to provide it with an existing id would solve the fact that it generates a new id each time.

Also another problem is that in the specified path for the checkpoints there doesn’t seem to be any files saved. I get an “PermissionError: [Errno 13] Permission denied:” when trying to save the checkpoint in a folder even though it saves all other information in the same script with no issue, I also tried running the script as admin and it gave me a “wandb: Network error (ReadTimeout), entering retry loop.” error and then the PermissionError again.

Thanks again for your time !

system · June 12, 2022, 3:31am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wandb Resume Logging W&B Help dashboard , wandb , beginner-friendly	3	1964	February 12, 2023
Wandb init resume not working W&B Help	4	496	January 23, 2024
Wandb.init resume can't find previous run W&B Help	2	313	January 18, 2025
Resuming run/training W&B Help projects , wandb	9	2952	August 9, 2022
Confusion with resume=true W&B Help	6	689	September 16, 2022

How to continue a specific run after stopping?

Related topics