Is it possible to continue training with additional epochs? Also where can I find logs in local?

Dear wandb team,

I currently start using pytorch-lightning combining with wandb, so I will use WandbLogger. (Due to link limits, I didn’t put url for the documentation.)

Suppose I had trained a model, let’s say for 10 epochs. The project is wandb_toy, and the name or ID is toy. After training, it will automatically create two folders under ./, i.e., ./wandb and ./wandb_toy. I know that the checkpoints will be saved in wandb_toy/toy/checkpoints. Also there will be a new run folder in ./wandb, let’s say ./wandb/run-20230101_102000.

Now if I want to continue the training from epochs=10 to epochs=20, I know I can load the model state with adding ckpt_path during trainer.fit(), also update the config with allow_val_change setting to True. However, there is still a new additional run folder created, e.g., ./wandb/run-20230202_104000.


My questions are:

  • I want to keep saving or update things in the original run folder ./wandb/run-20230101_102000.
  • Also is there any way to name the run folder, for example, change ./wandb/run-20230101_102000 to ./wandb/my_toy_run. I have tried keywords like dir or save_dir already, but seems not right.
  • If I delete the project in wandb ai (without delete folders in ./wandb), then continue training from epoch=10 to epoch=20, It will log only from epoch=10~20. Is there any way to still get the previous log from epoch=0~10? I have tried to look up Save & Restore Files or Resume Runs, but unfortunately I couldn’t figure it out.
  • If I want to see each epochs log (e.g., accuracy and loss) in my local, which file should I look for?

Thanks for reading and I apologize if I couldn’t make things clear.

Best wishes,
Yian

Hi @ntuyianchen, thanks for writing in and for sharing this detailed explanation of your use-case! Answering your questions:

  • This is not possible at the moment as every time you resume a run a new folder is created run-date_time (this name is not mutable either other than manually) although the run is the same. This is useful to be able to track easily the different processes.
  • You cannot resume a run at a previous step, it will only be resumed from the last step. The intention of this is to avoid overwritting previous logged data.
  • Log metrics are not saved locally (other than the last epoch in files/wandb-metadata.json), but they are accesible through our public API by accesing run.history(). Here you can have a look at the documentation on how to do this.

Please let me know if these answers would be useful. Also, if you would like to have any of these features available, feel free to explain which ones and I will create a new request!

Hi @luis_bergua1 , thanks for the concrete reply! It’s really helpful and now I can understand why and how wandb designed this way. The run.history() is really the thing I was looking for, thank you very much!

Thanks to wandb team for making this awesome package and documentations. I have one thing that’s irrelevant to our discussion. I found that the text format seemed to be wrong at here (the table for Arguments and Returns don’t break new lines correctly in my Mac Safari, which makes it a little hard to read.) This is not really an immediate big deal and I think it can be fixed in the future some day :slight_smile: .

Cheers!

Hi @ntuyianchen, great to see that run.history() works for you! Thank you very much for the kind feedback, we really appreciate it! Regarding the table in the docs, I am not fully understanding you, would you like to have a different align foe the text/have it justified?

Hi @luis_bergua1 , sorry for my ambiguous feedback. I was expecting that the argument should be something like

samples (int, optional): The number …
pandas (bool, optional): Return …
keys (list optional): Only return …

Since without line-breaking, it’s quite hard to find the argument at first sight.

Thanks a lot for clarifying @ntuyianchen! I see the issue now, I’ll report it!