Resuming sweep runs on a cluster with job time limits

Many users (including myself) on our compute cluster use wandb Sweeps, but a current pain point is our cluster admins limit each job length to 6 hours. For some applications this is not enough time to train Sweep trial configs to convergence, so they get cut off. I see in the docs we have “resuming a run which was executed as part of a Sweep is not supported.” :frowning:

However the pre-emptible sweeps section also mentions it can be possible to mark runs as being pre-empted, and “resume logging at the step where it was interrupted”. Sounds great, but there are a couple concerns:

  1. How can I ensure this resuming will actually resume properly, i.e. pick up model weights where they left off and so on? If I save model weights, optimizer state, etc with wandb.save(), will they be automatically pulled in when doing wandb.init(resume=True)? Or do I need to explicitly use wandb.restore()?
  2. More importantly, I’m not sure how to actually implement this for our system. The current workflow is to generate the sweep config and sweep id, then submit a bunch of jobs (one job per trial) to the cluster with the appropriate sweep id. Each of these uses wandb.agent followed by wandb.init to get a sweep trial config and run it. However then presumably I’d also have to launch some jobs with wandb.init(resume=True) to pick up the runs that don’t finish in time (the number of which I won’t know a priori), and these will clog up the queue. I guess this would all have to be manual – I’d go and find all the runs which didn’t finish and launch a corresponding number of jobs to complete them?

Hi @pzharrington,

For 1., You would need to call restore() if you called save(). In general, you would want to read and initialize your model using a checkpoint you have saved. The resume flag just sets up the W&B run to where it was saved, we do not interface with your model. Each run has a .resumed property which holds a boolean value so that you can write a simple if-else statement for this.

For 2., Could you share where you will be training your models? I’ll have to look into this further and get back to you.

Thanks,
Ramit

Hi @ramit_goolry, thanks for the response. For (1) it sounds like save() and restore() would work well. Ideally our users wouldn’t have to keep track of where checkpoints lived on the local filesystem, etc, if the resuming for a run could be based on some flag and pull the checkpoint from wandb – I can describe more below.

For (2), our system is Perlmutter, a supercomputer where job submission/scheduling is handled by Slurm. We cannot change the job time limits for a number of reasons, so to get long-running jobs, users have to jump through extra hoops, e.g. schedule a reservation on the system, and this is not ideal. Thus for sweeps where each trial may take longer than 6 hours (our job time limit), there is a need for checkpoint/restart of individual sweep runs. I’d assume there might be something you could do with mark_preempting() and the sweep controller/backend where a run could be marked as pre-empted and get re-queued with some flag like “needs resuming”. Then whenever the next agent/job is launched, it gets assigned that run and can pull the checkpoint from wandb to resume training at the right point, by checking the same “needs resuming” flag.

This would also be great for fault tolerance/general pre-emptible instances (e.g. on a cloud provider) as well.

Hey @pzharrington,

Apologies for the delay here - You are right, mark_preempting immediately communicates to the backend that the run is about to be preempted. In terms of resuming, is there a persistent storage on perlmutter (or maybe a network attached storage) to which you can communicate the active sweep ID? That should let you resume the process by storing the current run ID external to the process.

Yes, we do have persistent storage where run IDs could be stored. Based on our conversation so far I’m confident I could manually set something up myself to handle the checkpoint/restart case. Something like, at the start of each sweep job, manually check on the local filesystem for pre-empted sweep runs, and resume those if they exist, otherwise proceed with a new trial.

I guess what I’m seeking is a more general and user-friendly setup, incorporated into the wandb backend for sweep agents, that would lower the implementation burden for our users. We have many early career researchers who are newcomers to DL and implementing things like this would probably be a barrier to many. This is getting more into the realm of feature request now, and I do see a request on GitHub for essentially this exact functionality here. In that issue there is talk of a new “rewind” feature which seems to have been delayed multiple times now. Any idea if that is still undergoing development?

Understood - In all honesty the rewind feature was deprioritized in favour of other features being released at the moment. I can increase the priority on this feature for you and try to push for its development, but I’m not sure if I can get you a timeline for its development at the moment.

I’ll have to have a chat with some folks internally and get back to you regarding the state of the rewind feature here.