My training process unsuccessfully ended, because of the failure of the uploading process for W&B.
I often use a script file to run multiple experiments at once. When one of it is tucked, others cannot be run.
How to jump this?
My training process unsuccessfully ended, because of the failure of the uploading process for W&B.
Hi @yangze68 , happy to help. From the image attached the Network error (TransientError) points
to potential packet loss attributed to a network error on the users end. Event though a single experiment might fail, wandb would still execute subsequent runs, depending on how a user sets up their experiments.
debug.log
and debug-internal.log
file of the crashing run for us to get a better sense of anything else is happening with the run. These are located in the wandb/ folder of the working directory of the project.Thank you
Hi @mohammadbakir, thanks for your reply. If I set the wand mode to offline, I don’t need to upload the file every time. But the new error accused. When I use the command wandb sync --sync-all
to upload the offline file, the upload speed is very slow, I think this problem is related to the sync with tensorboad, which is mentioned in [CLI] Slow uploads of offline runs · Issue #1972 · wandb/wandb · GitHub
How can I share the debug.log file with you? By e-mail or here?
Thanks again for your help
When I copy the cached folder to another machine and upload it successfully.
And it throws a warning
Thank you for the update @yangze68 . Please send them to support@wandb.ai
and include my name in the subject line. I will perform some tests on the offline syncing of tensorboard output to wandb and get back to you with my findings.
Hi @yangze68 , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.