Resume run not working for sweep run

I am looking to resume some runs that crashed due to a timeout on a slurm cluster. I am doing this by supplying wandb.init with id=run_id and resume="must", however it is unable to resume as it seemingly is not finding the run or something like that.

My resume code is as follows:

            init_args = {}
            if args.run_id is not None:
                init_args['id'] = args.run_id
                init_args['resume'] = 'must'
                init_args['project'] = wandb_project
            with wandb.init(**init_args):
                # training code here (including loading the checkpoint in the case of resume)

When I run this it logs the following:

wandb: Sweep Agent: Waiting for job.
wandb: Sweep Agent: Exiting.

and then sets the online Sweep State to finished. I am doing a grid search, and every parameter combination has a run associated with it, however some crashed due to the timeout. I am passing the IDs for these runs to my program (i.e. args.run_id contains the run ID of a crashed run), yet this happens.

Is there anything I am missing?

To provide some more context, I am starting a wandb agent with the correct sweep_id and passing it the function that contains the code I put in my post above

I just read this:

Note that resuming a run which was executed as part of a Sweep is not supported.

on this page Resume W&B Runs.

I suppose my issue is solved now, however I would definitely like to see this added as this makes it difficult to run sweeps on slurm clusters where processes may time out

1 Like

Hi @timvandamcs

Thank you for writing in with your question. The statement above is correct, you cannot resume runs that were part of a sweep. To re run a specific configuration of a sweep, wait till the sweep runs to completion. Then delete the runs that failed from the run tables. Resume the sweep and run a sweep agent with the existing sweep id. The failed runs will re-execute.

I will add your notes on resuming runs of a sweep as a feature request with our product team.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.