Resume run not working for sweep run

timvandamcs · January 12, 2023, 5:47pm

I am looking to resume some runs that crashed due to a timeout on a slurm cluster. I am doing this by supplying wandb.init with id=run_id and resume="must", however it is unable to resume as it seemingly is not finding the run or something like that.

My resume code is as follows:

            init_args = {}
            if args.run_id is not None:
                init_args['id'] = args.run_id
                init_args['resume'] = 'must'
                init_args['project'] = wandb_project
            with wandb.init(**init_args):
                # training code here (including loading the checkpoint in the case of resume)

When I run this it logs the following:

wandb: Sweep Agent: Waiting for job.
wandb: Sweep Agent: Exiting.

and then sets the online Sweep State to finished. I am doing a grid search, and every parameter combination has a run associated with it, however some crashed due to the timeout. I am passing the IDs for these runs to my program (i.e. args.run_id contains the run ID of a crashed run), yet this happens.

Is there anything I am missing?

timvandamcs · January 13, 2023, 11:49am

To provide some more context, I am starting a wandb agent with the correct sweep_id and passing it the function that contains the code I put in my post above

timvandamcs · January 15, 2023, 4:37pm

I just read this:

Note that resuming a run which was executed as part of a Sweep is not supported.

on this page Resume W&B Runs.

I suppose my issue is solved now, however I would definitely like to see this added as this makes it difficult to run sweeps on slurm clusters where processes may time out

mohammadbakir · January 17, 2023, 9:14pm

Hi @timvandamcs

Thank you for writing in with your question. The statement above is correct, you cannot resume runs that were part of a sweep. To re run a specific configuration of a sweep, wait till the sweep runs to completion. Then delete the runs that failed from the run tables. Resume the sweep and run a sweep agent with the existing sweep id. The failed runs will re-execute.

I will add your notes on resuming runs of a sweep as a feature request with our product team.

system · March 18, 2023, 9:15pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Resuming sweep runs on a cluster with job time limits W&B Help sweeps	8	1855	February 4, 2023
How to distinguish resumed runs during sweeps? W&B Help sweeps	5	550	June 20, 2022
Repeated wandb.init() in parallelized wandb sweeps W&B Help sweeps , wandb	0	33	November 30, 2024
Sweep agent will always start another run after finishing (on SLURM) W&B Help sweeps	4	261	July 3, 2024
What is the correct way to resume a paused or crashed run? W&B Help dashboard , sweeps , questions , wandb , beginner-friendly	4	4203	June 9, 2023

Resume run not working for sweep run

Related topics