Hi all! Before my question, let me describe my setup. I am running a training, offline, and then I use a checkpoint from that training to resume training later, also offline. I use the same run ID for both the initial training and the resumed trainings, so wandb generates log folders with the same names but different timestamps.
When I run wandb sync --sync-all, it appears to sync all of the directories. However only some of the plots get updated with the new data from the resumed runs while others don’t. Is there any reason why this might happen?
Try running wandb sync ~/my_run_path/wandb/run-timestamp-runid for that particular run and see if that syncs the run to wandb correctly
Send a link to the project that has runs not updating
Send the debug logs for a run that isn’t updating.
The debug.log and debug-internal.log files related to the run can be obtained from the wandb folder located in the same directory as the script execution. The wandb folder contains subfolders with names formatted as run-DATETIME-ID, each representing a single run. Can you retrieve the mentioned log files from the folder corresponding to the specific run that is encountering problems?
So I’ve been able to replicate the issue again. Regarding the steps you mentioned:
So I tried the individual resuming and it didn’t work.
Here is a run from a project that isn’t updating: Weights & Biases
The weird thing about this is though is that this run was able to update several times successfully, but at some point it just stops being able to update and I don’t know why. You can see this in the Steps Per Second plot where there are spikes at around 200k and 420k which is when the run resumed from a checkpoint.
Here’s a set of debug logs. The first pair of debug logs is from the 420k offline resuming, which DID update the plots correctly (these are the logs with _correct). The second set of debug logs is from when I resumed training at the 700k mark, but with these offline folders the results don’t sync.
Looking into this, could I ask what packages are you using for your training?
It does look like the run has been synced to wandb so could you try wandb sync --include-offline --include-synced and see if wandb can re-upload it. Also, updating your wandb to 0.14.0 may help. There has been a similar bug in the past but it was fixed so I am curious to see what may be causing a single run to not be synced. Lastly, I looked into your debug logs and nothing stands out but I will keep looking.
Hi @raphael-sanandres ! Could you clarify what you mean by what packages I’m using for training? I’m not using any packages, just standard PyTorch.
I tried the sync command you provided but sadly that didn’t work either. This is becoming a rather frustrating issue because it means that I can’t do split long training jobs on clusters with reliability since I don’t know if Wandb will be able to actually log the data. I appreciate you still looking, but I might also start looking for a wandb alternative in the meantime that does handle offline jobs correctly.
I have been asking around internally for this issue and we want to take a closer look at your local machine’s wandb setup to see if there is an issue with other parts of the wandb. In the same directory as where you ran your Python file for training, there should be a wandb folder. This should be the same folder where you navigated to grab the debug bundle.
Hello @raphael-sanandres
Recently, I encountered a similar problem. By setting “resume=true” and “mode=offline” in the “wandb.init()” function, I was able to obtain multiple offline folders with the same run id but different timestamps as names, such as “wandb/offline-run-20230426_140332-1pe01d2q” and “wandb/offline-run-20230426_140604-1pe01d2q”. When I attempted to upload them separately using the “wandb sync” command, the training process of the second uploaded file did not show up in the charts or system page as expected. It is worth noting that: 1) on the overview page of the run, the start time, config, and summary were updated correctly, and only the information on the chart page did not update. 2) Regarding the difference between these two offline files, they started recording the training process from different steps due to loading different checkpoints. When I did not load the ckpt in the second experiment, the aforementioned problem did not occur, but this is clearly not what I hoped for.
Just this morning, I noticed the exact same thing Kevin just commented about, and have posted a detailed working example on a related github issue. See [Feature] Resume offline runs · Issue #2423 · wandb/wandb · GitHub . I’m really hoping that a fix can be provided soon, since both offline runs and resuming from checkpoint are strict requirements in my usecase, and they were certainly working at some point.
Thanks for everyone’s feedback! Since there are many of you reporting this, I work on reproducing this and writing an internal report on it. Thank you for also posting a reproducible example!
Hi again; just wanted to mention here as well, that I got in contact with the support team by email too, and got the solution to the issue that I mentioned in my latest reply. See the mentioned feature for a resolution that has been working for me. Thank you!