No config file and system plots for offline runs

When uploading offline runs, there is no config shown in the dashboard.
Additionally, there are also no system plots.

Is there any way to fix this issue?

Thank you very much for your help!
Cedric

Hi @vanillawhey, do you get any errors when calling wandb sync or does it seem that the run is synced without any issues?

Also, would you mind sharing a link to one of your runs here and I can take a look?

One last thing to check is the wandb-summary.json that should be located in your offline run directory on your machine. Is this empty or do you see all of the summary metrics you logged in there?

Thank you,
Nate

Hi @nathank,
thank you for taking the time to help us with our problem.
We get a warning when synchronizing, but uploading the files works.

wandb sync -e jgu-wandb -p peer-learning --id id_name Path_to_folder

wandb: WARNING Found {} directories containing tfevent files. If these represent multiple experiments, sync them individually or pass a list of paths.
Found 24 tfevent files in Path_to_folder
Syncing: https://wandb.ai/jgu-wandb/peer-learning/runs/id_name ...

The link to a run: link

The wandb-summary.json is written after the upload and contains only one line

> '{"_wandb": {"runtime": 40591}}'

The wandb-metadata.json is more meaningful:

> {
>     "os": "Linux-4.18.0-348.12.2.el8_5.x86_64-x86_64-with-centos-8.5-Arctic_Sphynx",
>     "python": "3.7.4",
>     "heartbeatAt": "2022-04-27T17:42:33.863340",
>     "startedAt": "2022-04-27T17:42:32.829508",
>     "docker": null,
>     "cpu_count": 40,
>     "cuda": null,
>     "args": [
>         "--save-name",
>         ...
>     ],
>     "state": "running",
>     "program": "run_dictator_new.py",
>     "codePath": "run_dictator_new.py",
>     "git": {
>         ...
>     },
>     "email": null,
>     "root": "...",
>     "host": "...",
>     "username": "...",
>     "executable": "/cluster/easybuild/broadwell/software/Python/3.7.4-GCCcore-8.3.0/bin/python"
> }

Our workflow is training on a cluster without direct internet access.
After the training, the data is copied to a computer via SSH and synchronized from there with wandb.
I hope you can help us further,
Jannis

Hi @wwjbrugger no problem and thank you for the info!

This looks like it could be a bug related to syncing this run. Are you possibly able to run this on a machine that has internet access such as a Colab and see if you are getting the same results? Even if it is a minimal example only training for 1 epoch.

Also, how are you passing in the config to W&B in your training script?

Thank you,
Nate

Hi Nathan,

Thanks for looking into this. We’re passing the config as dict into the wandb.init method. The config is correctly stored and uploaded on online runs. For offline runs, it doesn’t work neither on a local machine with internet access nor with the results from the computing cluster. We’ve realized that for those runs, a config.yaml is never created in contrast to the online runs.

Bests,
Cedric

@vanillawhey thank you for the update. Are you logging anything outside of TensorBoard? For example with wandb.log()?

Let’s try to sync with wandb sync --no-sync-tensorboard <path/to/run> and see if we can get the config and summary metrics to show up.

I still think this is a bug and related to combination of TensorBoard and offline mode. I’m currently trying to replicate on my end.

Lastly could you mention what version of W&B you are running?

Hi @vanillawhey, I wanted to check back and see if you had a chance to try this out?

Hi @nathank,

sorry for my late reply and thank you very much for your help. :slight_smile:
We’gve managed to get the config displayed in the dashboard.

Unfortunately, we specified the wrong path for the upload, i.e., the experiment folder.
The config is uploaded and displayed correctly when we specify offline run folder.

Example
old : wandb sync …/experiment_name
new : wandb sync …/experiment_name/wandb/offline-run-20220515_002356-jdlxek9r

Apparently, now, we get a new error:

.wandb: ERROR Metric data exceeds maximum size of 10.4MB (12.4MB)
wandb: ERROR Summary data exceeds maximum size of 10.4MB. Dropping it.
done.

However, the configuration is displayed correctly on the website.
Thanks again for your help and if we can’t get the new error under control, we’ll ask in a new issue.

Bests,
Cedric and Jannis

@vanillawhey glad this was able to resolve the issue for you!

For the maximum upload size issue, one thing I would recommend is limiting the frequency of logging if you are logging any sort of histogram such as gradients with wandb.watch().

If you find that you’re still struggling to stay under the limit feel free to start a new issue or use the chat in the UI and we can look into what specifically may be causing these large summary file sizes.

Thank you,
Nate

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.