How to override existing statistics in plot when resuming from a checkpoint?

Hello all! I’m trying to set up proper training checkpointing and resuming for my code and thus far I’ve gotten things to work but there is still one thing I am trying to figure out, which is how to get the logs in wandb to get overwritten/replaced after I load a checkpoint.

For instance, right now in my code if I save a checkpoint at 5000 timesteps, let training run for a few more thousand timesteps, cancel it, and then load and resume training from that 5000 step checkpoint, a training plot will look like this:

image

This is because the built-in Wandb Step value didn’t also reset back to 5k for when I restarted training from the 5k checkpoint, it just kept going. What I would instead like to have happen is that the Step value is synced with when I save the checkpoint so that when I resume, the existing plot is overridden, rather than continued. Is it possible to do this? Thanks in advance!

Hi @chulabhaya, thanks for your question! If I’m understanding you properly, you’re resuming your run with wandb.init(id='run_id', resume='allow') and then trying to override losses/alpha_loss with wandb.log({'losses/alpha_loss':value}, step=x) where x is an already logged step? If this is the case this isn’t possible now as when you resume a run you can only log from the latest step (or a higher one). I can create a new feature request for this if you want, just explain me your use-case to share the full context with our Product Team. Thanks!

Hi Chulabhaya,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

Best,
Luis

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.