Hello,
I was trying to resume my run after a crash, but got confused about some points.
The questions would mainly be about the resume and id argument in wandb.init().
I have read the Resume Runs docs and followed thing mentioned in it.
Precisely I have initialized my run as follows.
my_project_name = "tmp"
my_id = "1r0f3yu4"
wandb.init(project=my_project_name, id=my_id, resume="must")
where I have found my_id in wandb/run-20221214_011018-1r0f3yu4 which is a directory that was automatically generated from the crashed run. I have also double checked that my_project_name is also same as the crashed run.
However,
Problem 1) I can see that the State in my Weight and Biases Workspace has change to “running” again, but cannot see any plots or logging information updated in the dashboard (which worked fine for the crashed run).
Problem 2) Instead of re-using the previous directory wandb/run-20221214_011018-1r0f3yu4, it generates a new directory wandb/run-anotherYYYYMMDD_anotherHHMMSS-1r0f3yu4. Is this the proper way it should work, or am I doing something wrong?
(Is it because of https://github.com/wandb/wandb/blob/main/wandb/sdk/wandb_init.py/ line299?)
Finally, my questions would be
Question 1) How should I resume my run? I want to continue logging my training stats on the same dashboard. (I am already saving my checkpoint for training with torch.load/torch.save function. Thus, I just wand to know how to resume my “logging” in my Weight and Biases workspace online.)
Question 2) Is Problem2 the proper way it should work? or am I doing something wrong?
I’m not a very good English speaker, please let me know if anything sounds unclear.
Thank you.