Resuming run/training

Hi (Please note the codes are in italic),
I created a new run using code below:

id = wandb.util.generate_id()
run = wandb.init(project=‘checkpoint’, name=‘new_load’, id=id, config=configs)

and the results (lets say for 10 epochs) were stored in my account as expected. I also saved the last model in the run using wandb.save(‘last_model.h5’). Now, I want to continue learning from epoch 10 for 10 more epochs till epoch 20 for the last_model. So, I first restore the model using the code below:

restored_model = wandb.restore(‘last_model.h5’, run_path="…/checkpoint/id")

then, I load the weights from restored_model to the model:

model = build_model()
model.load_weights(restored_model.name)

and then I compiled the model. However, when I execute model.fit(), nothing happens, that is the code is executed without any error but there is no training and no epoch just like executing an empty cell.

num_epoch = config.epochs - wandb.run.step
model.fit(x_train, y_train, batch_size=config.batch_size, verbose=1, epochs=num_epoch, validation_data=(x_valid, y_valid), shuffle=False, initial_epoch=wandb.run.step, callbacks=[ WandbCallback(training_data=(x_train, y_train), validation_data=(x_valid, y_valid))])

I really appreciate any help as I am so in need of resuming training.

By the way, I have been wondering why in the example below which is in the resume documentation you use model.compile() while loading the entire model. You won’t need compile the model when you load the entire model. I believe it is not correct and you need to edit the code:
import keras
import numpy as np
import wandb
from wandb.keras import WandbCallback

wandb.init(project=“preemptible”, resume=True)

if wandb.run.resumed:
# restore the best model
model = keras.models.load_model(wandb.restore(“model-best.h5”).name)
else:
a = keras.layers.Input(shape=(32,))
b = keras.layers.Dense(10)(a)
model = keras.models.Model(input=a, output=b)

model.compile(“adam”, loss=“mse”)
model.fit(np.random.rand(100, 32), np.random.rand(100, 10),
# set the resumed epoch
initial_epoch=wandb.run.step, epochs=300,
# save the best model if it improved each epoch
callbacks=[WandbCallback(save_model=True, monitor=“loss”)])

Hi @sajmahmo,

I’m sorry this is happening to you - this is very odd behavior. We’ll have to run some tests to get a better sense of the situation here. To start, could you restore the model and call model.evaluate on a hold out set to make sure that the weights are restored correctly? Additionally, it might help to check the value of num_epoch that is being sent to model.fit. The values of config.epochs and wandb.run.step might not be aligned.

I’ll try to reproduce this issue on my end as well, but it certainly would help to know the results on your end where this issue is known to appear.

Thanks,
Ramit

Hi,

I tried what you said and seems the weights are correctly restored. The metric is quite high, so it can’t be weights. num_epoch is also fine because I tested before. For example, If I already trained the model with 100 epochs, and I want to continue training to another 100 epochs, I will set config.epochs=200, and wandb.run.step is equal to 100, so when I pass num_epoch to model.fit(), it will train the model for 100 more epochs until the epoch 200 (considering that initial_epoch=wandb.run.step).

I think it must be the argument “resume” in wandb.init(). It is somehow confusing whether it should be True or “must” although I tried both, but both resulted in the same issue. I believe resume simply does not match with model.fit() because for other things, it works well. For instance, when I wanted to correct the info of configs for a run, I used resume=True or 'must', and it was corrected in the run overview.

Thanks for your support,
Sajjad

Dear Ramit,

It is been while since your response to my question. I wonder whether you have been working on my problem or what. Please let me know what is going on or it is closed as your point of view.

Regards,
Sajjad

Hi Sajjad,

Apologies about the delay here. I did look into your problem and tried to reproduce it on my end, but I was not able to reproduce the bug you are seeing. On my end, the restored model seems to train fine and learn correctly. Would it be possible for you to share a google colab with a reproduction of your issue?

Thanks,
Ramit

Hi Ramit,

I can’t upload ipynb file here. How can I share the file with you?

Regards,
Sajjad

Hey @sajmahmo,

You can email us at support@wandb.com with the file!

Thanks,
Ramit

Hi Sajjad,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Weights & Biases

Dear Ramit,

I sent the email containing the colab file with the title, Resuming run/training.

Regards,
Sajjad