Resuming training

markstent · May 8, 2023, 3:54pm

Hey everyone, Im new to WandB and would love some advice.

This is my current setup:

Run the model first time and save the model every epoch (based on a variable) using the following:

log wandb artifact

            model_artifact = wandb.Artifact(
                f'{args.project_name}',
                type='model',
                description='sonic-diffusion-model-256'
                )
        
            model_artifact.add_dir(args.output_dir)
            wandb.log_artifact(
                model_artifact,
                aliases=[f'step_{global_step}', f'epoch_{epoch}']

i have resume as ‘True’ in the configs
I then load the last saved model (i am using diffusion from hugging face):

if wandb.run.resumed:
print(“Resuming run…”)
artifact_name = args.model_resume_name
artifact = wandb.use_artifact(artifact_name)

    # Download the model file(s) and return the path to the downloaded artifact
    artifact_dir = artifact.download()

    pipeline = AudioDiffusionPipeline.from_pretrained(artifact_dir)

    mel = pipeline.mel
    model = pipeline.unet

How do i continue training from the last epoch i left off from? Is 3) above even necessary? does the resume load the optimizer settings, learning rate at specific epoch?

The docs are not very clear.

I hope i am articulating myself properly.

Mark

uma-wandb · May 10, 2023, 9:42pm

Hi Mark,

I responded to your issue via email shortly ago, but will respond here as well for visibility.

It looks like you’re storing your epochs as aliases, and in order to properly resume training from a given epoch, you need to access that explicitly under your if wandb.run.resumed line. One way you could go about doing that is by including the following line underneath to properly access the correct epoch:

start_epoch = int(filter(lambda alias: alias.startswith(‘epoch’), artifact.aliases)[0].split(‘_’)[1])

which would give you the correct epoch to start at going forward.

Let me know if you need anything else!

Uma

markstent · May 11, 2023, 7:16am

Where would i use the start_epoch once i have it in the training loop…where would i apply this to make it work?

markstent · May 11, 2023, 7:20am

I assume it would be in the training loop and changing it to:

for epoch in range(start_epoch, args.num_epochs):

uma-wandb · May 11, 2023, 6:56pm

Hi Mark,

You are definitely correct to assume it would be in the training loop and that you need to change that particular line of code. Since you save the most recent epoch # when you save the artifact, you should be referencing start_epoch+1 to get the following epoch.

Also, since you are calling scheduler.step() and optimizer.step(), be sure to save those (either as an artifact or anything else of your choosing) to ensure you’re using the correct values when resuming from a specific epoch.

Best,

Uma

system · July 10, 2023, 7:21am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Resuming run/training W&B Help projects , wandb	9	2948	August 9, 2022
How to wandb.restore a keras model saved using WandbModelCheckpoint W&B Help projects , wandb	15	885	March 19, 2024
Is it possible to continue training with additional epochs? Also where can I find logs in local? W&B Help wandb , beginner-friendly	6	2294	March 20, 2023
How to continue a specific run after stopping? W&B Help wandb	7	6635	June 12, 2022
Is it possible to delete the resumed part from a run? W&B Help wandb	3	692	November 4, 2023

Resuming training

log wandb artifact

Related topics