No config file and system plots for offline runs

When uploading offline runs, there is no config shown in the dashboard.
Additionally, there are also no system plots.

Is there any way to fix this issue?

Thank you very much for your help!
Cedric

Hi @vanillawhey, do you get any errors when calling wandb sync or does it seem that the run is synced without any issues?

Also, would you mind sharing a link to one of your runs here and I can take a look?

One last thing to check is the wandb-summary.json that should be located in your offline run directory on your machine. Is this empty or do you see all of the summary metrics you logged in there?

Thank you,
Nate

Hi @nathank,
thank you for taking the time to help us with our problem.
We get a warning when synchronizing, but uploading the files works.

wandb sync -e jgu-wandb -p peer-learning --id id_name Path_to_folder

wandb: WARNING Found {} directories containing tfevent files. If these represent multiple experiments, sync them individually or pass a list of paths.
Found 24 tfevent files in Path_to_folder
Syncing: https://wandb.ai/jgu-wandb/peer-learning/runs/id_name ...

The link to a run: link

The wandb-summary.json is written after the upload and contains only one line

> '{"_wandb": {"runtime": 40591}}'

The wandb-metadata.json is more meaningful:

> {
>     "os": "Linux-4.18.0-348.12.2.el8_5.x86_64-x86_64-with-centos-8.5-Arctic_Sphynx",
>     "python": "3.7.4",
>     "heartbeatAt": "2022-04-27T17:42:33.863340",
>     "startedAt": "2022-04-27T17:42:32.829508",
>     "docker": null,
>     "cpu_count": 40,
>     "cuda": null,
>     "args": [
>         "--save-name",
>         ...
>     ],
>     "state": "running",
>     "program": "run_dictator_new.py",
>     "codePath": "run_dictator_new.py",
>     "git": {
>         ...
>     },
>     "email": null,
>     "root": "...",
>     "host": "...",
>     "username": "...",
>     "executable": "/cluster/easybuild/broadwell/software/Python/3.7.4-GCCcore-8.3.0/bin/python"
> }

Our workflow is training on a cluster without direct internet access.
After the training, the data is copied to a computer via SSH and synchronized from there with wandb.
I hope you can help us further,
Jannis

Hi @wwjbrugger no problem and thank you for the info!

This looks like it could be a bug related to syncing this run. Are you possibly able to run this on a machine that has internet access such as a Colab and see if you are getting the same results? Even if it is a minimal example only training for 1 epoch.

Also, how are you passing in the config to W&B in your training script?

Thank you,
Nate

Hi Nathan,

Thanks for looking into this. We’re passing the config as dict into the wandb.init method. The config is correctly stored and uploaded on online runs. For offline runs, it doesn’t work neither on a local machine with internet access nor with the results from the computing cluster. We’ve realized that for those runs, a config.yaml is never created in contrast to the online runs.

Bests,
Cedric

@vanillawhey thank you for the update. Are you logging anything outside of TensorBoard? For example with wandb.log()?

Let’s try to sync with wandb sync --no-sync-tensorboard <path/to/run> and see if we can get the config and summary metrics to show up.

I still think this is a bug and related to combination of TensorBoard and offline mode. I’m currently trying to replicate on my end.

Lastly could you mention what version of W&B you are running?