No config file and system plots for offline runs

vanillawhey · April 28, 2022, 8:48am

When uploading offline runs, there is no config shown in the dashboard.
Additionally, there are also no system plots.

Is there any way to fix this issue?

Thank you very much for your help!
Cedric

nathank · May 2, 2022, 2:32pm

Hi @vanillawhey, do you get any errors when calling wandb sync or does it seem that the run is synced without any issues?

Also, would you mind sharing a link to one of your runs here and I can take a look?

One last thing to check is the wandb-summary.json that should be located in your offline run directory on your machine. Is this empty or do you see all of the summary metrics you logged in there?

Thank you,
Nate

wwjbrugger · May 2, 2022, 4:31pm

Hi @nathank,
thank you for taking the time to help us with our problem.
We get a warning when synchronizing, but uploading the files works.

wandb sync -e jgu-wandb -p peer-learning --id id_name Path_to_folder

wandb: WARNING Found {} directories containing tfevent files. If these represent multiple experiments, sync them individually or pass a list of paths.
Found 24 tfevent files in Path_to_folder
Syncing: https://wandb.ai/jgu-wandb/peer-learning/runs/id_name ...

The link to a run: link

The wandb-summary.json is written after the upload and contains only one line

> '{"_wandb": {"runtime": 40591}}'

The wandb-metadata.json is more meaningful:

> {
>     "os": "Linux-4.18.0-348.12.2.el8_5.x86_64-x86_64-with-centos-8.5-Arctic_Sphynx",
>     "python": "3.7.4",
>     "heartbeatAt": "2022-04-27T17:42:33.863340",
>     "startedAt": "2022-04-27T17:42:32.829508",
>     "docker": null,
>     "cpu_count": 40,
>     "cuda": null,
>     "args": [
>         "--save-name",
>         ...
>     ],
>     "state": "running",
>     "program": "run_dictator_new.py",
>     "codePath": "run_dictator_new.py",
>     "git": {
>         ...
>     },
>     "email": null,
>     "root": "...",
>     "host": "...",
>     "username": "...",
>     "executable": "/cluster/easybuild/broadwell/software/Python/3.7.4-GCCcore-8.3.0/bin/python"
> }

Our workflow is training on a cluster without direct internet access.
After the training, the data is copied to a computer via SSH and synchronized from there with wandb.
I hope you can help us further,
Jannis

nathank · May 9, 2022, 2:01pm

Hi @wwjbrugger no problem and thank you for the info!

This looks like it could be a bug related to syncing this run. Are you possibly able to run this on a machine that has internet access such as a Colab and see if you are getting the same results? Even if it is a minimal example only training for 1 epoch.

Also, how are you passing in the config to W&B in your training script?

Thank you,
Nate

vanillawhey · May 9, 2022, 3:36pm

Hi Nathan,

Thanks for looking into this. We’re passing the config as dict into the wandb.init method. The config is correctly stored and uploaded on online runs. For offline runs, it doesn’t work neither on a local machine with internet access nor with the results from the computing cluster. We’ve realized that for those runs, a config.yaml is never created in contrast to the online runs.

Bests,
Cedric

nathank · May 13, 2022, 11:38pm

@vanillawhey thank you for the update. Are you logging anything outside of TensorBoard? For example with wandb.log()?

Let’s try to sync with wandb sync --no-sync-tensorboard <path/to/run> and see if we can get the config and summary metrics to show up.

I still think this is a bug and related to combination of TensorBoard and offline mode. I’m currently trying to replicate on my end.

Lastly could you mention what version of W&B you are running?

nathank · May 19, 2022, 8:47pm

Hi @vanillawhey, I wanted to check back and see if you had a chance to try this out?

vanillawhey · May 23, 2022, 5:29am

Hi @nathank,

sorry for my late reply and thank you very much for your help.
We’gve managed to get the config displayed in the dashboard.

Unfortunately, we specified the wrong path for the upload, i.e., the experiment folder.
The config is uploaded and displayed correctly when we specify offline run folder.

Example
old : wandb sync …/experiment_name
new : wandb sync …/experiment_name/wandb/offline-run-20220515_002356-jdlxek9r

Apparently, now, we get a new error:

.wandb: ERROR Metric data exceeds maximum size of 10.4MB (12.4MB)
wandb: ERROR Summary data exceeds maximum size of 10.4MB. Dropping it.
done.

However, the configuration is displayed correctly on the website.
Thanks again for your help and if we can’t get the new error under control, we’ll ask in a new issue.

Bests,
Cedric and Jannis

nathank · May 23, 2022, 7:29pm

@vanillawhey glad this was able to resolve the issue for you!

For the maximum upload size issue, one thing I would recommend is limiting the frequency of logging if you are logging any sort of histogram such as gradients with wandb.watch().

If you find that you’re still struggling to stay under the limit feel free to start a new issue or use the chat in the UI and we can look into what specifically may be causing these large summary file sizes.

Thank you,
Nate

system · July 22, 2022, 7:29pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Impossible to sync offline runs (.wandb file is empty) W&B Help wandb	3	1129	April 28, 2023
Wandb.sync crashed W&B Help wandb	5	634	October 13, 2023
How to jump the W&B upload process when the network is not so good? W&B Help wandb , beginner-friendly	5	1924	February 12, 2023
Uploading stuck for both 'wandb online' OR 'wandb offline' + 'wandb sync' W&B Help wandb	8	611	August 14, 2024
Offline Sync Stalls after Missing Artefact W&B Help artifacts , wandb	7	1533	April 3, 2023

No config file and system plots for offline runs

Related topics