Hello all! I’m trying to set up proper training checkpointing and resuming for my code and thus far I’ve gotten things to work but there is still one thing I am trying to figure out, which is how to get the logs in wandb to get overwritten/replaced after I load a checkpoint.
For instance, right now in my code if I save a checkpoint at 5000 timesteps, let training run for a few more thousand timesteps, cancel it, and then load and resume training from that 5000 step checkpoint, a training plot will look like this:
This is because the built-in Wandb Step value didn’t also reset back to 5k for when I restarted training from the 5k checkpoint, it just kept going. What I would instead like to have happen is that the Step value is synced with when I save the checkpoint so that when I resume, the existing plot is overridden, rather than continued. Is it possible to do this? Thanks in advance!