Understanding local cache, online sync and `save/restore

migl · April 25, 2024, 10:31am

Hi all,

I have the following problem for which I’m trying to find a suitable approach:

I’m running wandb sweeps
I’m running wandb agent on a SLURM cluster on which my jobs are preempted every 4h.
Hence, I need to save and restore my model, optimizer and replay buffer regularly.
I was planning on using wandb.save and wandb.restore to do so. That seems to work fine for model & optimizer, but the replay buffer is LARGE, hence I don’t want to actually upload it to wandb.

I think my ideal solution would be:

Sync logs online
Save the replay buffer only locally (I assume that’s what the ~/.cache directory is for) and locally restore from there when a new job starts. All the jobs share the same drive.
I don’t care too much whether model and optimizer are saved online, as long as they can be restored.

Is that possible to only selectively sync stuff online?
Do you have any other ideas how I could solve this?

I could of course manually save the replay buffer somewhere, but for that I need to manually differentiate between different runs. The docs say that run.name is not unique. Would run.sweep_id + run.name be guaranteed to be unique?

fmamberti-wandb · April 25, 2024, 4:15pm

Hi @migl, thank you for reaching out with your question.

Regarding having to save the buffer locally in per-run unique locations, if the runs are all from the same project each run has a randomly generated id which is unique that you could use with run.id. Using project+run.id would ensure that across different projects this is unique.

I will also investigate if there are alternative options for your use case and I will get back to you with further info.

Finally, as I believe you are part of an Enterprise Team for W&B, I wanted to let you know that you could also have access to the shared Slack channel between W&B and your Company for support - you may need to check internally on how to request access to that, but it would a more direct support channel for you to go to.

migl · April 26, 2024, 4:24am

Thank you, I didn’t realize that channel existed :).

fmamberti-wandb · April 26, 2024, 12:21pm

No problem - let me know if the project+run.id works for you for the replay buffer while I investigate other possible workflow for this.

fmamberti-wandb · May 3, 2024, 5:40pm

Hi @migl , I wanted to follow up here.

Your solution to keep the replay buffer offline seems the optimal one at this time - I wanted to check if you had the chance to test using the project+run.id as unique identifier for each run, and if there anything else I can help you with on this.

fmamberti-wandb · May 8, 2024, 10:07am

Hi @migl , I wanted to follow up on this request. Please let us know if we can be of further assistance

fmamberti-wandb · May 10, 2024, 12:38pm

Hi @migl , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Topic		Replies	Views
100% offline sweep W&B Help sweeps , wandb	15	3152	July 6, 2023
Resuming sweep runs on a cluster with job time limits W&B Help sweeps	8	1885	February 4, 2023
Best practices for many quick runs? W&B Help	13	1811	February 6, 2022
Bottleneck in uploads W&B Help sweeps , wandb	4	750	January 15, 2024
How to replay prompts and evaluate multiple model output quality with past runs? W&B Help wandb , beginner-friendly	6	595	November 11, 2023

Understanding local cache, online sync and `save/restore

Related topics