Wandb Resume Logging

Hello,

I was trying to resume my run after a crash, but got confused about some points.
The questions would mainly be about the resume and id argument in wandb.init().

I have read the Resume Runs docs and followed thing mentioned in it.
Precisely I have initialized my run as follows.

my_project_name = "tmp"
my_id = "1r0f3yu4"
wandb.init(project=my_project_name, id=my_id, resume="must") 

where I have found my_id in wandb/run-20221214_011018-1r0f3yu4 which is a directory that was automatically generated from the crashed run. I have also double checked that my_project_name is also same as the crashed run.

However,
Problem 1) I can see that the State in my Weight and Biases Workspace has change to “running” again, but cannot see any plots or logging information updated in the dashboard (which worked fine for the crashed run).

Problem 2) Instead of re-using the previous directory wandb/run-20221214_011018-1r0f3yu4, it generates a new directory wandb/run-anotherYYYYMMDD_anotherHHMMSS-1r0f3yu4. Is this the proper way it should work, or am I doing something wrong?
(Is it because of https://github.com/wandb/wandb/blob/main/wandb/sdk/wandb_init.py/ line299?)

Finally, my questions would be
Question 1) How should I resume my run? I want to continue logging my training stats on the same dashboard. (I am already saving my checkpoint for training with torch.load/torch.save function. Thus, I just wand to know how to resume my “logging” in my Weight and Biases workspace online.)

Question 2) Is Problem2 the proper way it should work? or am I doing something wrong?

I’m not a very good English speaker, please let me know if anything sounds unclear.

Thank you.

Hi MinKyu,

Thanks for writing in! For your questions:

  • How should I resume my run? Here you have a code snippet where I am creating a run, finishing it and then resuming and logging data again. This new data appears in the UI properly. Could you try following the same flor in your process? If it still does not work, I can have a look at your code and see what is happening here.

    import wandbrun = wandb.init(project=‘resume_runs’)id = run.idfor i in range(5): run.log({‘metric’:i})run.finish()run_1 = wandb.init(project=‘resume_runs’, id=id, resume=“must”)for i in range(5): run_1.log({‘metric’:5+i})run_1.finish()

  • Is Problem2 the proper way it should work? This is the right way, as the folder contains the date of when the run is created, so a new folder will be created when resuming the run.

Best,
Luis

Hi MinKyu,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Luis

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.