Unable to log each run when using pytorch lightning integration

I’m able to log a training run with pytorch lightning + wandb based on these instructions in google colab. Here’s a snippet of code I’m running:

wandb_logger = WandbLogger(project="p", entity="e")
trainer = pl.Trainer(
    logger=wandb_logger,    # W&B integration
    ..
)
trainer.fit(model)

it outputs the link to the run and I can see all of the stats etc.

However, how can I retrain? If I try re-training with:

trainer = pl.Trainer(
    logger=wandb_logger,    # W&B integration
    ..
)
trainer.fit(model)

It doesn’t seem to log a new run. It looks like it doesn’t even log the data to the existing run, it is just completely lost.

If I try to create a new wandb logger before the re-training:

wandb_logger = WandbLogger(project="p", entity="e")
trainer = pl.Trainer(
    logger=wandb_logger,    # W&B integration
    ..
)
trainer.fit(model)

it times out after 1 minute with this error:

andb: ERROR Error communicating with wandb process
wandb: ERROR For more info see: https://docs.wandb.ai/library/init#init-start-error
Problem at: /usr/local/lib/python3.7/dist-packages/pytorch_lightning/loggers/wandb.py 406 experiment
---------------------------------------------------------------------------
UsageError                                Traceback (most recent call last)
<ipython-input-44-6016437e3426> in <module>
----> 1 wandb_logger = WandbLogger(project="p", entity="e")

6 frames
/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_init.py in init(self)
    717                     backend.cleanup()
    718                     self.teardown()
--> 719                 raise UsageError(error_message)
    720             assert run_result and run_result.run
    721             if run_result.run.resumed:

UsageError: Error communicating with wandb process
For more info see: https://docs.wandb.ai/library/init#init-start-error```

Am I using wandb + pytorch lightning the correct way?  What is the expected lifecycle of the wandb logger in relation to the pl training object?

Actually nevermind, it’s working now! I don’t know what I changed that fixed it while I was trying to debug, maybe I had forgotten to call .finish() on the first run.

Here’s the gist of the code I’m running:

wandb_logger = WandbLogger(project="p", entity="e", log_model=True)
trainer = pl.Trainer(
    logger=wandb_logger,    # W&B integration
    ..
)
trainer.fit(model)
wandb.finish()

I’m able to run the above snippet repeatedly and it creates a new run each time.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.