How to continue a specific run after stopping?

Hello, I am new to using wandb and I cant seem to wrap my head around how to continue a run after i stop it. I tried the wandb.restore and loading the weights from the “wandb\run-20220313_020710-18ws9vua\files” , but i seem to get the following error : “wandb.errors.CommError: Could not find run” .
It seems that the run isn’t unique? Do i need to set up something in the init ? Or am I just going about it all wrong?
Thank you very much for your time !

Hi @frem,

I’m sorry you are facing this issue. Could you have a look at our Resuming Guide to see if this resolves your issue?

If not, I would really appreciate it if you could send a whole stack trace over, I’ll help you debug this further.

Thanks,
Ramit

1 Like

Hello, it seems resume is the way to go. I tried to implement it but it doesn’t seem to be working right.

this is my code, I have removed everything that has to do with the training and pre-processing, when I first start training I have resume = False, and after I interrupt the run I change it to resume = True and run it again, but a new run starts, also the checkpoints folder I created is empty so no files have been saved .

thank you very much for your time again !

def main():
  #Weight and Biases
  torch.manual_seed(0) # to fix the split result

  CHECKPOINT_PATH = r'C:\Users\wandb\checkpoints'

  run = wandb.init(project="my-test-project", entity="frem",save_code=True, resume= True)
  if wandb.run.resumed:
    checkpoint = torch.load(wandb.restore(CHECKPOINT_PATH))
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    train_batch_loss = checkpoint['train_batch_loss']
    train_batch_acc = checkpoint['train_batch_acc']
    epoch_train_loss = checkpoint['epoch_train_loss']
    epoch_train_acc = checkpoint['epoch_train_acc']
    epoch_val_loss = checkpoint['epoch_val_loss']
    epoch_val_acc = checkpoint['epoch_val_acc']
    best_loss = checkpoint['best_loss']
    counter = checkpoint['counter']
    early_stop = checkpoint['early_stop']

   CONFIG =  dict(
    model_conf= "resnet50",
    lr_conf= 0.001,
    max_epochs= 100,
    batch_size= 32,
    optimizer= "Adam",
    loss_conf= "CrossEntropyLoss"
   )
   wandb.config = CONFIG
   print("\tWANDB SET UP DONE")

   # TRANSFORMS

   # DATA SET UP

   # MODEL SET UP

    # Training and Validation

    #after training 1 epoch I log these
     wandb.log({
            "train_batch_loss": train_batch_loss,
            "train_batch_acc": train_batch_acc
        })
    
      wandb.log({
        "epoch_train_loss": torch.tensor(train_losses).mean(),
        "epoch_train_acc": torch.tensor(train_accuracies).mean()
      })

      #after validation I log
       
       wandb.log({
          "epoch_val_loss": torch.tensor(val_losses).mean(),
          "epoch_val_acc": torch.tensor(val_accuracies).mean()
       })
 
     #and in in the end of the each epoch

      torch.save({ # Save our checkpoint loc
           'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'train_batch_loss': train_batch_loss,
            'train_batch_acc': train_batch_acc,
            'epoch_train_loss': torch.tensor(train_losses).mean(),
            'epoch_train_acc': torch.tensor(train_accuracies).mean(),
            'epoch_val_loss': torch.tensor(val_losses).mean(),
            'epoch_val_acc': torch.tensor(val_accuracies).mean(),
            'best_loss': best_loss,
            'counter': counter,
            'early_stop': early_stop,
            }, CHECKPOINT_PATH)
       wandb.save(CHECKPOINT_PATH) # saves checkpoint to wandb
    
    
       torch.save(model.state_dict(), os.path.join('D:\Art DataBase\models',f"{CONFIG['model_conf']}_{epoch}.pth"))
       wandb.save(os.path.join('D:\Art DataBase\models', f"{CONFIG['model_conf']}_{epoch}.pth"))

Hi @frem,

Are you passing in the id of the previous run you are trying to resume? The SDK needs to know the ID in order to pick up the run where it left off. Odds are that wandb.init() is creating a new run ID and looking for that ID in your previous runs, which would explain the “Could not find run” error message you are recieving.

Thanks,
Ramit

​Hi @frem ,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Hi @frem , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Hi, I am a bit confused with how ids work, do I need to use WANDB_RESUME and WANDB_RUN_ID in order to set the id for the run, or altering the way that I use wand.init to provide it with an existing id would solve the fact that it generates a new id each time.

Also another problem is that in the specified path for the checkpoints there doesn’t seem to be any files saved. I get an “PermissionError: [Errno 13] Permission denied:” when trying to save the checkpoint in a folder even though it saves all other information in the same script with no issue, I also tried running the script as admin and it gave me a “wandb: Network error (ReadTimeout), entering retry loop.” error and then the PermissionError again.

Thanks again for your time !

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.