Overwrite previously logged metrics when resuming a run

Hi. When a run crashes at iteration T+X but the last available checkpoint is at iteration T, it is currently not possible to resume the run at T and continue training while overwriting the previously logged metrics between T and T+X. I tried the following:

resume = "must": Seemingly “works” but displays warning that the previously logged values cannot be overwritten and the new ones are going to be ignored, the corresponding plot then contains old metrics up to T+X and then shows the new metrics. This may not be a bug but it is not the commonly desired behavior (since the old metrics from the crashed run between T and T+X are no longer relevant).

resume=None: I’m pretty sure this used to work in the past but repeated tests show that it doesn’t work now. The behavior is strange: W&B does not report any warnings, the run on the web shows certain signs that it has been resumed (for example, the console output is logged and the run duration is updated) but no new metrics (beyond T+X) appear in the plots, see example below. I am pretty sure this is some kind of bug because this hardly looks like the desired behavior. I seem to recall that I have been using this option in the past but it isn’t working any more.

Example:

image

Green run: crashed at iter=14, resumed using resume=must at iter=10 and the values should start from 1 again. It can be seen that up to 14 the old values are kept and only for iter=15 onwards the new values are logged.

Orange run: crashed at iter=14, resumed using resume=None at iter=10 and no values beyond 14 appear although they have been logged.

Blue dashed: Desired behavior (drawn by hand).

The toy script:

import random, time
import wandb

is_new = True # run first with True to start new experiment and then with False to resume the last experiment

if(is_new):
	run = wandb.init(project="test")
	iter = 0

	# save run_id for the future
	with open("_run_id.txt", "w") as file:
		file.write(run.id)
else:
	# read run_id of the previous run
	with open("_run_id.txt") as file:
		run_id = file.readline()

	run = wandb.init(id=run_id, resume=None, project="test") # try different options for 'resume'
	iter = 10 # last valid iter from the crashed run is 10 (ie simulate last available checkpoint at this iter)

# experiment
i = 0
while(True):
	iter += 1
	i += 1
	run.log({"val": i + random.uniform(-1, 1)}, step=iter)

	time.sleep(.5)

	if(is_new and iter >= 14):
		raise SystemExit(1) # simulates crash at iter == 14
	
	if(iter >= 20):
		break # normal exit of finished experiment

print("finished", "new" if is_new else "resume")

Hi @hookxs, thank you for writing in and sending the information.

Regarding your requirement:

When a run crashes at iteration T+X but the last available checkpoint is at iteration T, it is currently not possible to resume the run at T and continue training while overwriting the previously logged metrics between T and T+X

We are introducing a new feature - fork from a run which address exactly your ask. You will be able to start a new Run forking from an existing one from a specificied logged step. See more information on the Fork from a Run.

As this is currently in beta, it not available by default but if you are happy to try it we can enable it from our side.

The behaviour you are seeing with the resume argument it is the expected behaviour: with must the Run will be resumed however, no metrics will be logged for step values that were previously logged. With resume none the Run is not resumed and a new run is logged. Looking at the toy code you sent, the last logged Step value should be 14:

if(is_new and iter >= 14):
		raise SystemExit(1) # simulates crash at iter == 14

Let me know if you have any further questions!