Hi everybody,
I have started to use weights and biases recently and am quite pleased with it. However, there is a problem that I struggle to figure out and I was hoping that somebody had an idea.
I use Pytorch lightning and work in a cloud setting. I set up my logging like this:
# ....
wandb_logger = WandbLogger(
project="some_name",
log_model="all",
config={ "all my": "config values",
# ...
}
trainer = pl.Trainer(
accelerator=device,
devices=1,
max_epochs=max_epochs,
logger=wandb_logger,
check_val_every_n_epoch=50,
)
trainer.fit(model=my_model, datamodule=my_datamodule)
In my training code, I log images like this:
def log_image(self, image: Image, title: str) -> None:
logger = self.logger
if isinstance(logger, WandbLogger):
logger.log_image(
key=title,
images=[image],
caption=[f"Step: {self.global_step}"],
step=self.global_step,
)
My problem is this: During my run on the cluster, sometimes the run will not get fully synced but only partially. This might be because of restricted internet access on the cluster but I am not sure. Most of the metrics are uploaded, and some of the images. Often, however, images are somehow only “half transmitted”: In my weights and biases dashboard, the name of the image is shown, but the images themselves just keep loading and never finish.
Thus, I cannot access the images. So I tried running wandb sync 'wandb/my-run'
after the run finished on a node where I am sure to have normal internet access. The result of this is that I have all metrics and images fully uploaded and available. However, this also overwrites my config with an empty one, so I lose my entire config and the run is somewhat useless.
Do you have any idea how to help me here?
I think one solution would be to only sync media and not overwriting the config, but I don’t know if that is possible. Or I could, at the end of the run, write ‘files/config.yaml’ again, because that file somehow does not exist for some runs, but I am not sure how to do this and how to do this safely without interfering with the run.
Another solution would be to fix whatever creates this weird “images are partly submitted” problem.
Any help is greatly appreciated! Thank you!
Best
-Manuel