I am looking to resume some runs that crashed due to a timeout on a slurm cluster. I am doing this by supplying wandb.init
with id=run_id
and resume="must"
, however it is unable to resume as it seemingly is not finding the run or something like that.
My resume code is as follows:
init_args = {}
if args.run_id is not None:
init_args['id'] = args.run_id
init_args['resume'] = 'must'
init_args['project'] = wandb_project
with wandb.init(**init_args):
# training code here (including loading the checkpoint in the case of resume)
When I run this it logs the following:
wandb: Sweep Agent: Waiting for job.
wandb: Sweep Agent: Exiting.
and then sets the online Sweep State to finished. I am doing a grid search, and every parameter combination has a run associated with it, however some crashed due to the timeout. I am passing the IDs for these runs to my program (i.e. args.run_id contains the run ID of a crashed run), yet this happens.
Is there anything I am missing?