Wandb Resume Logging

2minkyulee · December 14, 2022, 1:34am

Hello,

I was trying to resume my run after a crash, but got confused about some points.
The questions would mainly be about the resume and id argument in wandb.init().

I have read the Resume Runs docs and followed thing mentioned in it.
Precisely I have initialized my run as follows.

my_project_name = "tmp"
my_id = "1r0f3yu4"
wandb.init(project=my_project_name, id=my_id, resume="must")

where I have found my_id in wandb/run-20221214_011018-1r0f3yu4 which is a directory that was automatically generated from the crashed run. I have also double checked that my_project_name is also same as the crashed run.

However,
Problem 1) I can see that the State in my Weight and Biases Workspace has change to “running” again, but cannot see any plots or logging information updated in the dashboard (which worked fine for the crashed run).

Problem 2) Instead of re-using the previous directory wandb/run-20221214_011018-1r0f3yu4, it generates a new directory wandb/run-anotherYYYYMMDD_anotherHHMMSS-1r0f3yu4. Is this the proper way it should work, or am I doing something wrong?
(Is it because of https://github.com/wandb/wandb/blob/main/wandb/sdk/wandb_init.py/ line299?)

Finally, my questions would be
Question 1) How should I resume my run? I want to continue logging my training stats on the same dashboard. (I am already saving my checkpoint for training with torch.load/torch.save function. Thus, I just wand to know how to resume my “logging” in my Weight and Biases workspace online.)

Question 2) Is Problem2 the proper way it should work? or am I doing something wrong?

I’m not a very good English speaker, please let me know if anything sounds unclear.

Thank you.

luis_bergua · December 16, 2022, 2:59pm

Hi MinKyu,

Thanks for writing in! For your questions:

How should I resume my run? Here you have a code snippet where I am creating a run, finishing it and then resuming and logging data again. This new data appears in the UI properly. Could you try following the same flor in your process? If it still does not work, I can have a look at your code and see what is happening here.

import wandbrun = wandb.init(project=‘resume_runs’)id = run.idfor i in range(5): run.log({‘metric’:i})run.finish()run_1 = wandb.init(project=‘resume_runs’, id=id, resume=“must”)for i in range(5): run_1.log({‘metric’:5+i})run_1.finish()
Is Problem2 the proper way it should work? This is the right way, as the folder contains the date of when the run is created, so a new folder will be created when resuming the run.

Best,
Luis

luis_bergua · December 20, 2022, 2:23pm

Hi MinKyu,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Luis

system · February 12, 2023, 1:34am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What is the correct way to resume a paused or crashed run? W&B Help dashboard , sweeps , questions , wandb , beginner-friendly	4	4203	June 9, 2023
How to continue a specific run after stopping? W&B Help wandb	7	6569	June 12, 2022
Wandb.init resume can't find previous run W&B Help	2	275	January 18, 2025
Wandb init resume not working W&B Help	4	487	January 23, 2024
Resuming run/training W&B Help projects , wandb	9	2916	August 9, 2022

Wandb Resume Logging

Related topics