I then load the last saved model (i am using diffusion from hugging face):
if wandb.run.resumed:
print(“Resuming run…”)
artifact_name = args.model_resume_name
artifact = wandb.use_artifact(artifact_name)
# Download the model file(s) and return the path to the downloaded artifact
artifact_dir = artifact.download()
pipeline = AudioDiffusionPipeline.from_pretrained(artifact_dir)
mel = pipeline.mel
model = pipeline.unet
How do i continue training from the last epoch i left off from? Is 3) above even necessary? does the resume load the optimizer settings, learning rate at specific epoch?
I responded to your issue via email shortly ago, but will respond here as well for visibility.
It looks like you’re storing your epochs as aliases, and in order to properly resume training from a given epoch, you need to access that explicitly under your if wandb.run.resumed line. One way you could go about doing that is by including the following line underneath to properly access the correct epoch:
You are definitely correct to assume it would be in the training loop and that you need to change that particular line of code. Since you save the most recent epoch # when you save the artifact, you should be referencing start_epoch+1 to get the following epoch.
Also, since you are calling scheduler.step() and optimizer.step(), be sure to save those (either as an artifact or anything else of your choosing) to ensure you’re using the correct values when resuming from a specific epoch.