Is it possible to continue training with additional epochs? Also where can I find logs in local?

ntuyianchen · January 10, 2023, 1:08pm

Dear wandb team,

I currently start using pytorch-lightning combining with wandb, so I will use WandbLogger. (Due to link limits, I didn’t put url for the documentation.)

Suppose I had trained a model, let’s say for 10 epochs. The project is wandb_toy, and the name or ID is toy. After training, it will automatically create two folders under ./, i.e., ./wandb and ./wandb_toy. I know that the checkpoints will be saved in wandb_toy/toy/checkpoints. Also there will be a new run folder in ./wandb, let’s say ./wandb/run-20230101_102000.

Now if I want to continue the training from epochs=10 to epochs=20, I know I can load the model state with adding ckpt_path during trainer.fit(), also update the config with allow_val_change setting to True. However, there is still a new additional run folder created, e.g., ./wandb/run-20230202_104000.

My questions are:

I want to keep saving or update things in the original run folder ./wandb/run-20230101_102000.
Also is there any way to name the run folder, for example, change ./wandb/run-20230101_102000 to ./wandb/my_toy_run. I have tried keywords like dir or save_dir already, but seems not right.
If I delete the project in wandb ai (without delete folders in ./wandb), then continue training from epoch=10 to epoch=20, It will log only from epoch=10~20. Is there any way to still get the previous log from epoch=0~10? I have tried to look up Save & Restore Files or Resume Runs, but unfortunately I couldn’t figure it out.
If I want to see each epochs log (e.g., accuracy and loss) in my local, which file should I look for?

Thanks for reading and I apologize if I couldn’t make things clear.

Best wishes,
Yian

luis_bergua · January 11, 2023, 2:23pm

Hi @ntuyianchen, thanks for writing in and for sharing this detailed explanation of your use-case! Answering your questions:

This is not possible at the moment as every time you resume a run a new folder is created run-date_time (this name is not mutable either other than manually) although the run is the same. This is useful to be able to track easily the different processes.
You cannot resume a run at a previous step, it will only be resumed from the last step. The intention of this is to avoid overwritting previous logged data.
Log metrics are not saved locally (other than the last epoch in files/wandb-metadata.json), but they are accesible through our public API by accesing run.history(). Here you can have a look at the documentation on how to do this.

Please let me know if these answers would be useful. Also, if you would like to have any of these features available, feel free to explain which ones and I will create a new request!

ntuyianchen · January 11, 2023, 5:18pm

Hi @luis_bergua , thanks for the concrete reply! It’s really helpful and now I can understand why and how wandb designed this way. The run.history() is really the thing I was looking for, thank you very much!

Thanks to wandb team for making this awesome package and documentations. I have one thing that’s irrelevant to our discussion. I found that the text format seemed to be wrong at here (the table for Arguments and Returns don’t break new lines correctly in my Mac Safari, which makes it a little hard to read.) This is not really an immediate big deal and I think it can be fixed in the future some day .

Cheers!

luis_bergua · January 13, 2023, 3:16pm

Hi @ntuyianchen, great to see that run.history() works for you! Thank you very much for the kind feedback, we really appreciate it! Regarding the table in the docs, I am not fully understanding you, would you like to have a different align foe the text/have it justified?

ntuyianchen · January 15, 2023, 3:16pm

Hi @luis_bergua , sorry for my ambiguous feedback. I was expecting that the argument should be something like

samples (int, optional): The number …
pandas (bool, optional): Return …
keys (list optional): Only return …

Since without line-breaking, it’s quite hard to find the argument at first sight.

luis_bergua · January 19, 2023, 12:02pm

Thanks a lot for clarifying @ntuyianchen! I see the issue now, I’ll report it!

system · March 20, 2023, 12:03pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Log evaluation to finished run W&B Help wandb	4	1429	September 2, 2023
Wandb Resume Logging W&B Help dashboard , wandb , beginner-friendly	3	1955	February 12, 2023
How to create a copy of wandb plots online as well as offline W&B Help wandb	13	1463	July 17, 2023
How to continue a specific run after stopping? W&B Help wandb	7	6612	June 12, 2022
Is it possible to delete the resumed part from a run? W&B Help wandb	3	689	November 4, 2023

Is it possible to continue training with additional epochs? Also where can I find logs in local?

Related topics