Hi. When a run crashes at iteration T+X but the last available checkpoint is at iteration T, it is currently not possible to resume the run at T and continue training while overwriting the previously logged metrics between T and T+X. I tried the following:
resume = "must": Seemingly “works” but displays warning that the previously logged values cannot be overwritten and the new ones are going to be ignored, the corresponding plot then contains old metrics up to T+X and then shows the new metrics. This may not be a bug but it is not the commonly desired behavior (since the old metrics from the crashed run between T and T+X are no longer relevant).
resume=None: I’m pretty sure this used to work in the past but repeated tests show that it doesn’t work now. The behavior is strange: W&B does not report any warnings, the run on the web shows certain signs that it has been resumed (for example, the console output is logged and the run duration is updated) but no new metrics (beyond T+X) appear in the plots, see example below. I am pretty sure this is some kind of bug because this hardly looks like the desired behavior. I seem to recall that I have been using this option in the past but it isn’t working any more.
Example:
Green run: crashed at iter=14, resumed using resume=must at iter=10 and the values should start from 1 again. It can be seen that up to 14 the old values are kept and only for iter=15 onwards the new values are logged.
Orange run: crashed at iter=14, resumed using resume=None at iter=10 and no values beyond 14 appear although they have been logged.
Blue dashed: Desired behavior (drawn by hand).
The toy script:
import random, time
import wandb
is_new = True # run first with True to start new experiment and then with False to resume the last experiment
if(is_new):
run = wandb.init(project="test")
iter = 0
# save run_id for the future
with open("_run_id.txt", "w") as file:
file.write(run.id)
else:
# read run_id of the previous run
with open("_run_id.txt") as file:
run_id = file.readline()
run = wandb.init(id=run_id, resume=None, project="test") # try different options for 'resume'
iter = 10 # last valid iter from the crashed run is 10 (ie simulate last available checkpoint at this iter)
# experiment
i = 0
while(True):
iter += 1
i += 1
run.log({"val": i + random.uniform(-1, 1)}, step=iter)
time.sleep(.5)
if(is_new and iter >= 14):
raise SystemExit(1) # simulates crash at iter == 14
if(iter >= 20):
break # normal exit of finished experiment
print("finished", "new" if is_new else "resume")
Hi @hookxs, thank you for writing in and sending the information.
Regarding your requirement:
When a run crashes at iteration T+X but the last available checkpoint is at iteration T, it is currently not possible to resume the run at T and continue training while overwriting the previously logged metrics between T and T+X
We are introducing a new feature - fork from a run which address exactly your ask. You will be able to start a new Run forking from an existing one from a specificied logged step. See more information on the Fork from a Run.
As this is currently in beta, it not available by default but if you are happy to try it we can enable it from our side.
The behaviour you are seeing with the resume argument it is the expected behaviour: with must the Run will be resumed however, no metrics will be logged for step values that were previously logged. With resume none the Run is not resumed and a new run is logged. Looking at the toy code you sent, the last logged Step value should be 14:
if(is_new and iter >= 14):
raise SystemExit(1) # simulates crash at iter == 14
Hi, thanks for taking the time to respond. The new “Fork from a run” feature sounds interesting and I can well imagine that it can be useful for example for the following scenario: I have a successfully finished run but I decide to slightly change some parameters (for example LR schedule) for the last few iterations to see how it affects the results. So I start a new run using “fork from a run” at a checkpoint shortly before the end, adjust the LR schedule and when it finishes, I have both runs available for comparison.
I am not sure, though, this functionality is the best candidate for resuming. In my case (and I am sure for other people as well), the most common reason for a crashed run is that somebody (either the scheduling service or a colleague who needs a GPU) simply kills it. What I need then is to simply finish the same run - i.e. continue from the last available checkpoint. I don’t want to start a new run, I want this one finished. I would expect then there is an option for the resume arg that allows that. Personally, I would suggest something like resume="overwrite" that behaves like "must"but allows starting from an earlier step (nothing fancy, just allows to overwrite existing values).
As for resume=None - ok, maybe this is the expected behavior, but I personally find it strange, because what it essentially does (as my toy example shows) that the logging simply doesn’t work - without issuing any warning or error, calls to wandb.log are silently ignored and logging is not resumed even in the situation when step exceeds the previously highest logged value. It may be the expected behavior but I hardly find it desirable. You say that in that case a new run is logged but as my example shows, it is not (at least not in cases when a run with the same ID already exists).
Regarding ‘Fork from a run’ - that is exactly the use case for this, being able to compare training pipelines which diverted from a specific step.
Regarding the the ability to overwrite previously logged steps, this was something that was prevented by design to ensure data integrity of the experiments was preserved. However, since we have heard various feedback about being able to overwrite previously logged data, we are currently working on a new feature Rewind which will allow to log metrics against previously logged steps. This should be available in the coming months.
Regarding the behaviour with resume=None, I see that is not behaving as expected and not providing any warning. As with resume=None the data should be overwritten, the data should be logged as a new Run with only val metric with values from steps 10 to 19, while data up to step 10 being deleted so this wouldn’t fit your use case unfortunately.
In the meantime, would you like us to enable the fork from a run feature as a workaround (having new forked runs whenever you have to resume from a previously logged step) or would you prefer to wait for the Rewind feature to be available?