I am training a model using PyTorch lightning on a cluster with a limited run time. Thus I train for a couple of epochs before saving all results and then starting a new job which resumes the training from the last checkpoint.
Previously I was running on a cluster where I could directly sync Wandb log files so when resuming training I also resumed logging like so:
latest_run_id = [x for x in os.listdir(f"{savedir_logging}/wandb/latest-run") if x.endswith(".wandb")][0].replace(".wandb", "").split("-")[-1]
wandb_logger = WandbLogger(project="project_name",
id=latest_run_id,
resume='must')
This was working beautifully exactly as intended so that at the end of my training I would have one Wandb log with all training steps from the different jobs.
Recently, I move to a different cluster where I need to use Wandb in offline mode. The other restrictions stills apply.
Here I am unfortunately running into some issues with the setup of the logging.
Using an analogous setup as previously I used:
latest_run_id = [x for x in os.listdir(f"{savedir_logging}/wandb/latest-run") if x.endswith(".wandb")]
wandb_logger = WandbLogger(project="Macrophage_Screen_Classifiers_raven",
id=latest_run_id,
save_dir = savedir_logging,
resume='must',
mode='offline')
I get the following warning:
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 4ugfzt9r.
wandb: Tracking run with wandb version 0.19.2
wandb: W&B syncing is set to `offline` in this directory.
This results in several directories in my wandb
Folder that all have the same run id name but with different times.
How do I get this all in the same Wandb log that I can view and access the log data from the API?
How do I get rid of the warning message and correctly resume the Wandb log?