Resuming run/training

sajmahmo · May 23, 2022, 10:28pm

Hi (Please note the codes are in italic),
I created a new run using code below:

id = wandb.util.generate_id()
run = wandb.init(project=‘checkpoint’, name=‘new_load’, id=id, config=configs)

and the results (lets say for 10 epochs) were stored in my account as expected. I also saved the last model in the run using wandb.save(‘last_model.h5’). Now, I want to continue learning from epoch 10 for 10 more epochs till epoch 20 for the last_model. So, I first restore the model using the code below:

restored_model = wandb.restore(‘last_model.h5’, run_path="…/checkpoint/id")

then, I load the weights from restored_model to the model:

model = build_model()
model.load_weights(restored_model.name)

and then I compiled the model. However, when I execute model.fit(), nothing happens, that is the code is executed without any error but there is no training and no epoch just like executing an empty cell.

num_epoch = config.epochs - wandb.run.step
model.fit(x_train, y_train, batch_size=config.batch_size, verbose=1, epochs=num_epoch, validation_data=(x_valid, y_valid), shuffle=False, initial_epoch=wandb.run.step, callbacks=[ WandbCallback(training_data=(x_train, y_train), validation_data=(x_valid, y_valid))])

I really appreciate any help as I am so in need of resuming training.

By the way, I have been wondering why in the example below which is in the resume documentation you use model.compile() while loading the entire model. You won’t need compile the model when you load the entire model. I believe it is not correct and you need to edit the code:
import keras
import numpy as np
import wandb
from wandb.keras import WandbCallback

wandb.init(project=“preemptible”, resume=True)

if wandb.run.resumed:
# restore the best model
model = keras.models.load_model(wandb.restore(“model-best.h5”).name)
else:
a = keras.layers.Input(shape=(32,))
b = keras.layers.Dense(10)(a)
model = keras.models.Model(input=a, output=b)

model.compile(“adam”, loss=“mse”)
model.fit(np.random.rand(100, 32), np.random.rand(100, 10),
# set the resumed epoch
initial_epoch=wandb.run.step, epochs=300,
# save the best model if it improved each epoch
callbacks=[WandbCallback(save_model=True, monitor=“loss”)])

ramit_goolry · May 25, 2022, 10:32pm

Hi @sajmahmo,

I’m sorry this is happening to you - this is very odd behavior. We’ll have to run some tests to get a better sense of the situation here. To start, could you restore the model and call model.evaluate on a hold out set to make sure that the weights are restored correctly? Additionally, it might help to check the value of num_epoch that is being sent to model.fit. The values of config.epochs and wandb.run.step might not be aligned.

I’ll try to reproduce this issue on my end as well, but it certainly would help to know the results on your end where this issue is known to appear.

Thanks,
Ramit

sajmahmo · May 26, 2022, 12:50pm

Hi,

I tried what you said and seems the weights are correctly restored. The metric is quite high, so it can’t be weights. num_epoch is also fine because I tested before. For example, If I already trained the model with 100 epochs, and I want to continue training to another 100 epochs, I will set config.epochs=200, and wandb.run.step is equal to 100, so when I pass num_epoch to model.fit(), it will train the model for 100 more epochs until the epoch 200 (considering that initial_epoch=wandb.run.step).

I think it must be the argument “resume” in wandb.init(). It is somehow confusing whether it should be True or “must” although I tried both, but both resulted in the same issue. I believe resume simply does not match with model.fit() because for other things, it works well. For instance, when I wanted to correct the info of configs for a run, I used resume=True or 'must', and it was corrected in the run overview.

Thanks for your support,
Sajjad

sajmahmo · June 3, 2022, 8:28am

Dear Ramit,

It is been while since your response to my question. I wonder whether you have been working on my problem or what. Please let me know what is going on or it is closed as your point of view.

Regards,
Sajjad

ramit_goolry · June 3, 2022, 8:03pm

Hi Sajjad,

Apologies about the delay here. I did look into your problem and tried to reproduce it on my end, but I was not able to reproduce the bug you are seeing. On my end, the restored model seems to train fine and learn correctly. Would it be possible for you to share a google colab with a reproduction of your issue?

Thanks,
Ramit

sajmahmo · June 5, 2022, 10:51pm

Hi Ramit,

I can’t upload ipynb file here. How can I share the file with you?

Regards,
Sajjad

ramit_goolry · June 6, 2022, 5:00pm

Hey @sajmahmo,

You can email us at support@wandb.com with the file!

Thanks,
Ramit

ramit_goolry · June 9, 2022, 7:39pm

Hi Sajjad,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

sajmahmo · June 10, 2022, 11:51am

Dear Ramit,

I sent the email containing the colab file with the title, Resuming run/training.

Regards,
Sajjad

system · August 9, 2022, 11:51am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to continue a specific run after stopping? W&B Help wandb	7	6635	June 12, 2022
Resuming training W&B Help wandb	5	744	July 10, 2023
How to wandb.restore a keras model saved using WandbModelCheckpoint W&B Help projects , wandb	15	885	March 19, 2024
Wandb init resume not working W&B Help	4	496	January 23, 2024
Wandb Resume Logging W&B Help dashboard , wandb , beginner-friendly	3	1961	February 12, 2023

Resuming run/training

Related topics