Understanding local cache, online sync and `save/restore

Hi all,

I have the following problem for which I’m trying to find a suitable approach:

  • I’m running wandb sweeps
  • I’m running wandb agent on a SLURM cluster on which my jobs are preempted every 4h.
  • Hence, I need to save and restore my model, optimizer and replay buffer regularly.
  • I was planning on using wandb.save and wandb.restore to do so. That seems to work fine for model & optimizer, but the replay buffer is LARGE, hence I don’t want to actually upload it to wandb.

I think my ideal solution would be:

  • Sync logs online
  • Save the replay buffer only locally (I assume that’s what the ~/.cache directory is for) and locally restore from there when a new job starts. All the jobs share the same drive.
  • I don’t care too much whether model and optimizer are saved online, as long as they can be restored.

Is that possible to only selectively sync stuff online?
Do you have any other ideas how I could solve this?

I could of course manually save the replay buffer somewhere, but for that I need to manually differentiate between different runs. The docs say that run.name is not unique. Would run.sweep_id + run.name be guaranteed to be unique?

Hi @migl, thank you for reaching out with your question.

Regarding having to save the buffer locally in per-run unique locations, if the runs are all from the same project each run has a randomly generated id which is unique that you could use with run.id. Using project+run.id would ensure that across different projects this is unique.

I will also investigate if there are alternative options for your use case and I will get back to you with further info.

Finally, as I believe you are part of an Enterprise Team for W&B, I wanted to let you know that you could also have access to the shared Slack channel between W&B and your Company for support - you may need to check internally on how to request access to that, but it would a more direct support channel for you to go to.

Thank you, I didn’t realize that channel existed :).

No problem - let me know if the project+run.id works for you for the replay buffer while I investigate other possible workflow for this.

Hi @migl , I wanted to follow up here.

Your solution to keep the replay buffer offline seems the optimal one at this time - I wanted to check if you had the chance to test using the project+run.id as unique identifier for each run, and if there anything else I can help you with on this.