Let’s say I ran a model for 500 epochs and want to run additional 100 epochs. In such case, I can specify run id to resume.
But what should I do if I later find that my 100 additional run was configured incorrectly? Or what if I have to shutdown the resume in the middle of it for any reason (like computer shutdown)? Can I delete that resumed part while keeping the 500 epochs? I can’t find a doc explaining if this is doable and how I can do that.
Unfortunately, once the run has been resumed and logged to the
wandb you will not be able to revert to the previous epochs. However, what you can do is execute the run in
offline mode and if the run was able to finish with no problems to then sync the results to
wandb. (Example) This will allow you to choose to sync the run only if the run was completed. Another option is to use Model Registry to checkpoint your model and use the checkpointed model to create new runs.
Hi Minkoo, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!