With TB SummaryWriter only getting sys logs, no log_scalar shows up

I have an OS stable diffusion fine tuner and use Tensorboard locally and am trying to integrate wandb with existing code that is largely just calling writer.log_scalar(…). I setup my SummaryWriter then call wandb.init, but I’m having all sorts of odd behavior where most of the time only system monitors (gpu temp, memory etc) are logged to wandb and my calls to writer.log_scalar simply never get recorded to wandb.

Everything seems to be failing silently and I don’t know why nothing gets recorded. The other day testing on two machines it works from one but not the other, and it is also now working from Colab notebook instances or docker container runs.

The runs on wandb.com are there and created, console output shows it fires up and links me to the run and the run URL works, etc. But, only system monitors are showing up, none of my items logged with summarywriter, at least a vast majority of instances.

At one point it was working fine, then started to stop working. I had thought it was an issue with trying to pass in a dict of dicts to config={main: args, opt_cfg: optimizer_cfg} but even passing in dummy objects or simply config=args it fails. At one point wanb.init was done before writer instantiation, and that was fixed, so I’m not sure at what point things went sideways as I mostly run locally but many users use Colab/Vast, etc and wandb is a significantly better solution for those cases.

Is there any log file or debugging I can use to troubleshoot this? Unfortunately it is just not working and doing so silently without any feedback.

Hi @panopstor are you using wandb.init(project='my-project', sync_tensorboard=True) or are using wandb.tensorboard.patch(root_logdir="<logging_directory>")to enable Tensorboard syncing?

When using sync_tensorboard=True we attempt to find the event files but if the SDK can’t find them then you end up with runs similar to what you are seeing. System metrics logged but no model metrics. I would recommend switching to wandb.tensorboard.patch(root_logdir="<logging_directory>" so you can explicitly point to the TB files. Here are the docs for this.

Also, you can sync the runs that didn’t upload model metrics by using the CLI command wandb sync --sync-tensorboard <path/to/tb/files>

Let me know if this helps or if you still see the issue.

Thank you,

Hi thanks for the help.

I got it working with the patch instead of init with sync, but now all my logged parameters are prepended with the logdir, ex.
when I’m calling
log_writer.add_scalar(tag="loss/log_step", scalar_value=loss_local, global_step=global_step)

and where “vg_sd15_wandb5_20230325-154049” must be picked up from the root_logdir I suppose. Is there any way to suppress this? It’s a lot of noise.

I’m still passing in project_name and run_name to init which WandB respects so the prefix to the parameters doesn’t serve much purpose. I tried toggling tensorboard_x, torch, and save args on patch(…). I pulled wandb off github and its not obvious where the prefix is being applied. Is that by design?

I’m using the latest tensorboard 2.12.0 and wandb 0.14.0. This was working a while back just using sync_tensorboard and broke, the previous behavior was as desired.

Or maybe you can provide hints on why normal sync_tensorboard=True wouldn’t work? I’m currently trying to dig through the wandb code to see if I can figure it out… If there are any debug log flags I could send into wandb maybe that would help me. I’d like to know why it isn’t picking anything up on its own with normal sync.

Ah I think I found the magic combination to work for my training script. For posterity in case anyone else has the issues and stumbles on this post. This is a raw torch trainer.

(ex log_folder = “logs/projectname20230325_124523” and contains the events.out.tfevents… file)

        wandb.tensorboard.patch(root_logdir=log_folder, pytorch=False, tensorboard_x=False, save=False)
        wandb_run = wandb.init(
            config={"main_cfg": vars(args), "optimizer_cfg": optimizer_config},
        log_writer = SummaryWriter(log_dir=log_folder...)


tensorboard 2.12.0
wandb 0.14.0

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.